Skip to main content

Normal Equation Method : A Quick Overview

Here's a cat - I didn't get a chance to work them into the post.

In the previous post, we were discussing Gradient Descent and why it has so many connections to Phase Spaces (check that out over here, if you haven't yet). There, I had dropped a hint about another Optimization Method we could use. Did you catch it?

When we discussed some of the calculus involved, I mentioned that the derivative or slope of a function at local minima or maxima is equal to 0. As it turns out, you can simply set the derivative to be equal to 0, and solve for the parameters required. This is sometimes referred to as the Normal Equation Method. 
I will talk about a case-specific simplification here. 

This method is called Least Squares and it minimizes the following error function, often known as Mean Squared Error, for reasons that will be apparent in a moment:

This formula may seem complicated, but don't worry, it is actually quite simple. Before we understand it, though, it is important to realize that which error function you use can change what kind of results you receive. For example, as compared to this Error function, called Mean Absolute Error:

MSE makes differences in yhat and y much more significant because of the square, and a greater reduction can be achieved in MSE by improving yhat, than in MAE. 

Anyway, let's understand MSE:

yhat represents the predictions of a Machine Learning (ML) algorithm, while y here represents the correct output for each of these cases. These Errors are often used as part of the evaluation process, to determine how effective an ML algorithm's model is. The sigma (that's the strange symbol with the i under it, and n above), represents a sum. The i is an iterative variable, meaning that it goes through each value from i = 1 to i = n, as denoted by the n at the top. Here, n might represent the number of examples. In essence, the formula is summing up the difference between the prediction and true output for each example, to produce the total error. This difference is squared, or raised to the power of two, because a square is always positive, and this prevents any of the examples' errors from cancelling each other out. Finally, the division by n is simply a relic of the calculus - it makes it substantially easier - and doesn't really affect the setup at any rate, since it simply scales it down. 

So now that we have understood MSE, let's understand Least Squares. While I will not be able to display the derivation process, since that involves multivariable calculus, I will just state the formula as it is, and we will try to understand where this is applicable and what it does.


OK, that looks like a lot. But it's not too bad, once you get the context.

In particular, this applies to a parameter in the first Machine Learning Algorithm we are going to discuss - Linear Regression. I won't spoil it too much, because we'll tackle that in a future blog post - at least I'll try -, but Linear Regression involves fitting a line through the data. In particular, the equation of that line looks like this:

If you ignore the strange theta symbols, you should recognize this as the plain old y = mx + b formula you probably learned in high/middle school, where m represents the slope or steepness of the curve, and b represents the height at which it intersects the y-axis. 
Over here, we replace m with theta1 and b with theta0. As part of the optimization process, as we saw in the Gradient Descent posts (you should really read those if you haven't read them), we determine the optimal parameters - represented by theta1 and theta0. In particular, this least squares formula determines the theta1 value. 

So the difference terms, i.e.:

 and 
 are simply calculating how much each value deviates from the overall mean. The statistical import of this is frankly, largely lost on me too, but I noticed that the summing process allows the least squares process to take into account all examples. Nevertheless I found this oversimplification helpful to gain intuition:

Considering that the formula for the slope of a line from back in middle/high school is very similar, and we are finding the slope here, I think this provides sufficient intuition, nevertheless, if you have any insight into the importance of using a form that is similar to the variance, please let me know in the comments. In any case, next week (hopefully soon), I hope to start another miniseries on this blog -
I think it will be pretty interesting, though it may initially appear a slight detour from Machine Learning. All the same, keep your eyes peeled!





Comments

Popular posts from this blog

Phase Spaces 1 : Graphs and Geometry

Phase Spaces One of the least heard of, and most interesting techniques of the sciences, that you rarely realize you’ve used before. Phase spaces are symbolic representations of a particular problem, which you can then use to solve it. Let’s start with a simple problem - in physics maybe. Let’s say we have a car, as all good physics problems do. You’re driving at a set initial speed, and a set acceleration. At what time would you have travelled exactly 15 ft? Let’s look at it in terms of "a phase space". I have a velocity-time graph down here:                                                                                                                                  Linear Velocity-Time Graph Nothing very exciting, but it’s a useful analogy. Here, the two variables involved (more on that later), are effectively the speed and the time. What you want to know are the success cases (totally a technical term), where the car travels 15 ft, no more, no less. How could you do tha

Phase Spaces 2 : Math and Gradient Descent

I'm going to start from where we left off in the last part of this series. If you haven't read that yet, check that out first if you want a more detailed understanding: We explored what a Phase Space is, why it's useful, what it has to do with Machine Learning, and more!  I'm assuming you've read the previous article, or you know what I talked about there: so let's get to it. At the end of the last article, we discovered that it was the power of mathematics that would help us find the best values of the parameters for the lowest cost function. Before we get into what the Math does, however, we'll need to define some things in the math. If you've done Calculus, and in particular, partial derivatives, you can skip this section, but otherwise I would suggest at least a cursory glance. I don't go into too much detail on the subject, but that's only because you won't need it.  Calculus Interlude: Derivatives- The slope of a graph is a concept you

Stochastic Optimization: NEW MINISERIES!

This is part of my new miniseries on Stochastic Optimization. While this is not taught in a lot of Machine Learning courses, it's an interesting perspective, applicable in an incredible number of fields. Nevertheless, this won't be a very long series, and when we exit it, it'll be time to dive straight into our first Machine Learning algorithm! Introduction to Optimization: Ok, so what is Optimization? As the name may suggest, Optimization is about finding the optimal configuration of a particular system. Of course, in the real world, the important question in any such process is this: in what sense? i.e. By what criteria do you intend to optimize the system? However, we will not delve too much into that just yet, but I promise, that will bring about a very strong connection to ML. Introduction to Stochastic Optimization: So far, as part of our blogposts, we have discussed Gradient Descent and the Normal Equation Method . These are both Optimization algorithms, but they di