Phase Spaces 2 : Math and Gradient Descent

I'm going to start from where we left off in the last part of this series. If you haven't read that yet, check that out first if you want a more detailed understanding: We explored what a Phase Space is, why it's useful, what it has to do with Machine Learning, and more!

I'm assuming you've read the previous article, or you know what I talked about there: so let's get to it.

At the end of the last article, we discovered that it was the power of mathematics that would help us find the best values of the parameters for the lowest cost function. Before we get into what the Math does, however, we'll need to define some things in the math. If you've done Calculus, and in particular, partial derivatives, you can skip this section, but otherwise I would suggest at least a cursory glance. I don't go into too much detail on the subject, but that's only because you won't need it.

Calculus Interlude: Derivatives-

The slope of a graph is a concept you’ve probably encountered a long time ago, back in school. Calculus formalizes this, and symbolizes the slope with what's called the derivative. You may or may not have encountered this, but that’s OK. Here’s what you do need to know:

a.) Simple Derivatives: dy/dx represents the slope of a graph of y on the vertical axis and x on the horizontal axis. This represents how much y changes with a particular miniscule change in x.

b.) Partial Derivatives: These are rather advanced derivatives, but the point is simply that they are the changes in y based on the change in x, where y is dependent on many variables/factors, of which x is only one. These are denoted as ∂y/∂x.

An important idea in Calculus is minima and maxima. Specifically, what are called relative minima or maxima (plural for minimum and maximum). Take a look at this graph. How many points would you classify as minima?

Calculus would classify 3 of these as relative minima, which denotes that they are the minimum values within some interval of the graph.

These will be very important, and the really nice thing about relative extrema(maxima or minima), is that the derivative of the function or graph is 0 there. This should make intuitive sense. If you look back at the graph above, you'll notice that around the points which you could think of as relative extrema, highlighted in the copy of the graph below, the graph looks flat. In other words, the slope there, represented by the derivative is 0. This is because:

$({y_{2}-y_{1}})=0$

for a flat line. That will be very important for our next blog post, so let's remember that!

The Hillock Optimization Problem-

Now that we've covered that, let's actually look at optimization, which basically tries to change the parameters so that you can get the best model. This means you have the model that makes the least mistakes basically. There are a couple of ways to do this, to arrive at the value of the parameters for which the cost is minimum(Yes, this is where the Calculus comes in). But the easiest way is a statistical technique that's at the basis of Machine Learning : Gradient Descent.

Here's a common analogy. Let's say you're on the top of a small hillock in a park, where the surface is very uneven. And let's say that somehow, you want to get to the lowest point you can and get there in the least time, while moving in small baby steps. How do you do it?

Here's an intuition of how you could do it, and then we'll actually get into the math - I promise, we will!

The best thing to do is first to look around. When you look around you, you see that the slope you're standing on is steeper one on side, and flatter on another. The difference may not be dramatic, but it is there. So when you take a small baby step down the steeper slope, you'll travel further downwards than you will if you take the same step in the other direction. This will ensure that you travel the furthest with the least step size. Ok, now that you have moved to your new place, which direction should you travel?

If you look around you again, you will still see that there's most probably one side that's steeper, and another that is flatter. Naturally, if you take the step in the steeper direction, you'll travel further. If you keep picking the steeper side, you'll find yourself on the ground either before or at the same time as any other person to leave the hillock at the same time, because you picked the optimum path.

OK, now we'll get to the math. Finally !

So, we're going to use the symbol theta (you can look it up if you're not sure what that looks like), to refer to your location as you travel down the hill. I'm also going to define another operator here, which will be essential. We're going to denote this by ':=', and what this means is that you deal with the variable as a variable on a computer, rather than a variable in math. What this means is that you should think about it as a memory location, rather than a mysterious value that stores some hidden number. This means that you can update the content of that memory location, you can change it! OK, that's all we need. To the Math!

The Math-

Let's say we have only one feature in a particular machine learning problem. Since there is only one feature, there is only one parameter (also known as a weight), which decides the importance of this particular feature.

What we want to do now, is bring this parameter to a particular value, so that the cost is the least. The easiest thing, is to plot a graph of the cost against the parameter. This will help us visualize the point we want to reach:

What we are going to do, is follow an iterative process. This means that we are going to repeat a particular process over and over to get closer to the result we want.

$\theta_{1}:=\theta_{1}-\alpha\frac{\partial J(\theta)}{\partial \theta_{1}}$

Let's break this down part-by-part:

The value theta on the right side of the := sign denotes the old value of the particular parameter. We are going to take into account what it was before, and then change it based on that. The alpha represents what is called a 'learning rate'. I'll explain this in just a minute, so don't worry about this. The partial derivative can look pretty complex, but in essence it just means the slope over there. This is really cool, because it gives you a significant speed up. Let's look at two graphs to understand it:

In both the graphs, the derivative or slope at the start point is negative. However, you can see that the second function has a much steeper negative slope than in the first case. Therefore, since the derivative of the second graph is steeper or greater (more negative), we increase the value of the parameter by more. This means that our parameter moves much further to the right in the second graph than in the first one.

The Hillock Optimization Problem(Again)-

Just to go back to the hillock optimization problem:

We can think about the value theta as representing your horizontal location. The cost function could be thought of as your height above the ground. When phrased in this way, the hill optimization problem is really just a question of 'how can you get to the minimum value of the cost function?'

So, in whichever direction you find the steepest slope, or in other words, the greatest derivative, you move in that direction. The steeper the slope, the further you go in the horizontal direction that way, so that you can take the most advantage of the steep slope, and come as far down as you can with the least steps, which is the fastest way down. This also ensures, that you do not keep travelling with in equidistant long stretches, because this could make you miss the minimum point when you come to it. Therefore, you spend effort (moving horizontally) proportional to how much you will be rewarded for it - how far you will come down.

The learning rate, denoted by alpha, would be analogous to how large your steps are when you move downwards. If you make the learning rate too large, it changes the parameter toooo much or travels too far forward, and could miss the minimum. In fact (and we'll go over this in that supplement post), this could cause you to just keep getting further and further from the minimum - usually, not what you want. If you make the learning rate too slow, you could take forever to make it down to the minimum, because you're imitating an ant or something - OK, the analogy kind of broke down there, but that's alright.

Congratulations! You have just learned Gradient Descent, a foundational concept in Machine Learning.

However, Gradient Descent actually looks much more complicated in its application. This is because there are very few problems that can be solved with one feature, and one parameter. Machine Learning problems can have millions of features, and millions of parameters, and it quickly becomes literally impossible to visualize what Gradient Descent is doing. However, the principle (and the math), is the same as here. This is why we use the partial derivative instead of the normal derivative in the update equation, because partial derivatives acknowledge that there could be other factors that affect the graph or function, but only finds out the relative rates of change between the graph and the variable in question, in this case, J and theta1.

Drunk Descent-

Lastly, let's talk about Stochastic Gradient Descent. When we thought about looking down from the hillock and figuring out the steepest side, you have to consider that it takes time to figure out what path to take. This is not a problem when you have a 3D world, which basically corresponds to two features, and it may not take so much time. However, if you had 13,000 features, and 13,000 parameters - that's a standard number in Machine Learning - looking down can take a LONG TIME. You don't want that!

So instead, Computer/Data Scientists use a concept known as Stochastic Gradient Descent. This means that you just analyze a short section of the data, called a batch or mini-batch, and based on a cursory look just at some number of features/parameters, you choose which direction to come down by. This is not very accurate, as you will expect, but it is a great deal faster. All the same, it can converge eventually on the minimum, which overall, with some acceptable tradeoff in accuracy, is much better than standard Gradient Descent. Grant Sanderson, from 3B1B, gives a really good analogy in his series on neural networks - we are covering some of the things he does, but trying to build up to a more practical and full ability, but his series on Youtube is incredibly visualized, so please do check that out! - :Stochastic Gradient Descent is like a drunk person coming down the slope versus a sober one. Since the drunk person doesn't deliberate too much, he makes his way down faster, and although he doesn't easily reach the real minima exactly, he's within a very reasonable distance of it much before the sober person is.

This does come with an important caution though, which is that, since the guy is drunk, he can very easily make the wrong decision, since he's basing his decisions on a very small number of factors, which could be skewed in the wrong direction. This may affect your overall destination, and you may not do very great. There are things you can do to avoid that, but we'll get to that eventually.

Next time , we'll discuss another method of optimization. In fact, this is based on something we discussed earlier - you'll know what I'm talking about, basically. We'll get to that!

Here's the video series by Grant Sanderson from 3Blue1Brown on Neural Networks. It gives some really good intuition, so I would definitely recommend it:

Neural Networks

Machines and Metaphors

Search This Blog

Phase Spaces 2 : Math and Gradient Descent

Labels

Comments

Popular posts from this blog

Phase Spaces 1 : Graphs and Geometry

Stochastic Optimization: NEW MINISERIES!