Normal Equation Method : A Quick Overview

Here's a cat - I didn't get a chance to work them into the post.

In the previous post, we were discussing Gradient Descent and why it has so many connections to Phase Spaces (check that out over here, if you haven't yet). There, I had dropped a hint about another Optimization Method we could use. Did you catch it?

When we discussed some of the calculus involved, I mentioned that the derivative or slope of a function at local minima or maxima is equal to 0. As it turns out, you can simply set the derivative to be equal to 0, and solve for the parameters required. This is sometimes referred to as the Normal Equation Method.
I will talk about a case-specific simplification here.

This method is called Least Squares and it minimizes the following error function, often known as Mean Squared Error, for reasons that will be apparent in a moment:

$MSE = \frac{1}{n}\sum_{i = 1}^{n}(y_{i}-yhat_{i})^2$

This formula may seem complicated, but don't worry, it is actually quite simple. Before we understand it, though, it is important to realize that which error function you use can change what kind of results you receive. For example, as compared to this Error function, called Mean Absolute Error:

$MAE = \frac{1}{n}\sum_{i = 1}^{n}(|y_{i}-yhat_{i}|)$

MSE makes differences in yhat and y much more significant because of the square, and a greater reduction can be achieved in MSE by improving yhat, than in MAE.

Anyway, let's understand MSE:

yhat represents the predictions of a Machine Learning (ML) algorithm, while y here represents the correct output for each of these cases. These Errors are often used as part of the evaluation process, to determine how effective an ML algorithm's model is. The sigma (that's the strange symbol with the i under it, and n above), represents a sum. The i is an iterative variable, meaning that it goes through each value from i = 1 to i = n, as denoted by the n at the top. Here, n might represent the number of examples. In essence, the formula is summing up the difference between the prediction and true output for each example, to produce the total error. This difference is squared, or raised to the power of two, because a square is always positive, and this prevents any of the examples' errors from cancelling each other out. Finally, the division by n is simply a relic of the calculus - it makes it substantially easier - and doesn't really affect the setup at any rate, since it simply scales it down.

So now that we have understood MSE, let's understand Least Squares. While I will not be able to display the derivation process, since that involves multivariable calculus, I will just state the formula as it is, and we will try to understand where this is applicable and what it does.

$\theta_{1} = \frac{\sum_{i = 1}^{n}(x_{i}-x_{avg})(y_{i}-y_{avg})}{\sum_{i=1}^{n}(x_{i}-x_{avg})^2}$

OK, that looks like a lot. But it's not too bad, once you get the context.

In particular, this applies to a parameter in the first Machine Learning Algorithm we are going to discuss - Linear Regression. I won't spoil it too much, because we'll tackle that in a future blog post - at least I'll try -, but Linear Regression involves fitting a line through the data. In particular, the equation of that line looks like this:

$y = \theta_{0}+\theta_{1}x$

If you ignore the strange theta symbols, you should recognize this as the plain old y = mx + b formula you probably learned in high/middle school, where m represents the slope or steepness of the curve, and b represents the height at which it intersects the y-axis.
Over here, we replace m with theta1 and b with theta0. As part of the optimization process, as we saw in the Gradient Descent posts (you should really read those if you haven't read them), we determine the optimal parameters - represented by theta1 and theta0. In particular, this least squares formula determines the theta1 value.

So the difference terms, i.e.:

$(x_{i}-x_{avg})$ and $(y_{i}-y_{avg})$
are simply calculating how much each value deviates from the overall mean. The statistical import of this is frankly, largely lost on me too, but I noticed that the summing process allows the least squares process to take into account all examples. Nevertheless I found this oversimplification helpful to gain intuition:

$\theta _{1} \approx \frac{(x*y)}{x^2} = \frac{y}{x}$

Considering that the formula for the slope of a line from back in middle/high school is very similar, and we are finding the slope here, I think this provides sufficient intuition, nevertheless, if you have any insight into the importance of using a form that is similar to the variance, please let me know in the comments. In any case, next week (hopefully soon), I hope to start another miniseries on this blog -
I think it will be pretty interesting, though it may initially appear a slight detour from Machine Learning. All the same, keep your eyes peeled!

Machines and Metaphors

Search This Blog

Normal Equation Method : A Quick Overview

Labels

Comments

Popular posts from this blog

Phase Spaces 2 : Math and Gradient Descent

Phase Spaces 1 : Graphs and Geometry

Stochastic Optimization: NEW MINISERIES!