This is part of my new miniseries on Stochastic Optimization. While this is not taught in a lot of Machine Learning courses, it's an interesting perspective, applicable in an incredible number of fields. Nevertheless, this won't be a very long series, and when we exit it, it'll be time to dive straight into our first Machine Learning algorithm!
Introduction to Optimization:
Ok, so what is Optimization? As the name may suggest, Optimization is about finding the optimal configuration of a particular system. Of course, in the real world, the important question in any such process is this: in what sense?
Ok, so what is Optimization? As the name may suggest, Optimization is about finding the optimal configuration of a particular system. Of course, in the real world, the important question in any such process is this: in what sense?
i.e. By what criteria do you intend to optimize the system?
However, we will not delve too much into that just yet, but I promise, that will bring about a very strong connection to ML.
Introduction to Stochastic Optimization:
So far, as part of our blogposts, we have discussed Gradient Descent and the Normal Equation Method. These are both Optimization algorithms, but they differ significantly from the ones we intend to cover in this miniseries. We can see why in a quick example. Our understanding of Gradient Descent was based on the analogy of trying to find your way to the lowest point available from a hillock.
Technical Note: There are actually three kinds of Gradient Descent, and these differ in their determinism, but that's not very important here.
In the version of Gradient Descent we talked about, if you start at a given point for a given terrain, you MUST reach the same minimum point no matter what. This is known as a deterministic approach, i.e. the goal state is already predetermined once you pick the starting point.
However, at the end of the second post on Gradient Descent, I mentioned a drunk man, staggering down the hill. Given the same starting point on the hill, it isn't necessary that he will always end up in the same place. There's a particular random component to him, and it's not easy to be able to tell where he's going to go next. Of course, there are advantages and disadvantages to that, and we will get to that in a minute, but it should be obvious that this is not a deterministic approach. Instead, this is known as a Stochastic Optimization process. Stochastic means probability or randomness, giving this variant its name: Stochastic Gradient Descent. There are other kinds of Gradient Descent, and I hope you will explore them - or just wait till we get to them.
Advantages and Disadvantages of Stochastic Approaches:
Technical Note: There are actually three kinds of Gradient Descent, and these differ in their determinism, but that's not very important here.
In the version of Gradient Descent we talked about, if you start at a given point for a given terrain, you MUST reach the same minimum point no matter what. This is known as a deterministic approach, i.e. the goal state is already predetermined once you pick the starting point.
However, at the end of the second post on Gradient Descent, I mentioned a drunk man, staggering down the hill. Given the same starting point on the hill, it isn't necessary that he will always end up in the same place. There's a particular random component to him, and it's not easy to be able to tell where he's going to go next. Of course, there are advantages and disadvantages to that, and we will get to that in a minute, but it should be obvious that this is not a deterministic approach. Instead, this is known as a Stochastic Optimization process. Stochastic means probability or randomness, giving this variant its name: Stochastic Gradient Descent. There are other kinds of Gradient Descent, and I hope you will explore them - or just wait till we get to them.
Advantages and Disadvantages of Stochastic Approaches:
Nevertheless, now that we have defined a stochastic optimization approach, let's think about what the advantages and disadvantages of using one could be. Going back to the drunk man, we see that he takes relatively less time to reach the base of the valley, which signifies that stochastic optimization approaches take less time to 'converge' to a reasonably good minimum. On the other hand, since he's inebriated, he will find it much more difficult to accurately pinpoint the exact minimum point, and reach there, suggesting that while stochastic processes give good results in little time, deterministic approaches can be much more accurate.
Those are somewhat obvious differences. However, there's another, very important difference that we need to consider. Before that, though, we need some more background - back to Calculus!
In the second Gradient Descent post (You should really read that, if you haven't already. It will make life a lot easier.), we discussed the concept of local minima in Calculus. In essence, local minima are points around which all the points are greater than itself. Another way to put this, is that you would be able to choose some interval around the point where all other values are greater than its value. We considered its importance, and realized that the optimization approaches we've covered thus far are nothing but ways to find these points. However, there is another kind of minimum. It's called a Global Minimum.
While a graph may have many local minima, however, there is almost always a particular point you would have designated the minimum before Calculus. This point is less than ALL other values of the function, and is known as the Global or Absolute Minimum. Given the context of finding the parameters that minimize the cost function, wouldn't it be nice to be able to find the Global Minima? (Actually, maybe not.)
However, if you have not noticed already, deterministic processes are susceptible to being caught in local minima. That's like the sober man sitting down for a picnic at the lowest spot his path lead him to because walking down a hill over and over for some obscure Machine Learning blog can be tiring. This point may not necessarily be the lowest point in the entire terrain, but it's the lowest one he found, and he's happy with that.
On the other hand, consider the drunk man. Since he didn't really understand the goal of reaching a minimum anyway, he's taking random steps, and sees no reason to stop when he reaches the lowest visible spot. While this may end up taking him some more time, winding through the combinations, he might eventually reach the global minimum after all.
However, if you have not noticed already, deterministic processes are susceptible to being caught in local minima. That's like the sober man sitting down for a picnic at the lowest spot his path lead him to because walking down a hill over and over for some obscure Machine Learning blog can be tiring. This point may not necessarily be the lowest point in the entire terrain, but it's the lowest one he found, and he's happy with that.
On the other hand, consider the drunk man. Since he didn't really understand the goal of reaching a minimum anyway, he's taking random steps, and sees no reason to stop when he reaches the lowest visible spot. While this may end up taking him some more time, winding through the combinations, he might eventually reach the global minimum after all.
This is the unseen advantage of stochastic approaches: they are less susceptible to local minima.
Overfitting:
With all this emphasis on the global minima, one may wonder, however. "Is it really all that great an idea to find the near-perfect model for the data?"
The answer is sometimes no. In one of my first posts, on 'Machines, Stereotypes and Bias/Variance', I discussed this extensively in the second half. Overfitting, also known as Variance, is the condition that arises, when the model you build is too specific to the data itself, and may end up missing the underlying idea in pursuit of the details. This would be like thinking it was necessary to have a weird green pencil to do math well if you're 40, because your math teacher has one - it misses the point. This happens when you fit the model too well to the data, so you want to make sure to retain just the general trend, not the specific details in a model.
This was a short introduction to the topic. Stay tuned for my next blogpost, on the first Stochastic Optimization process: Simulated Annealing!
The answer is sometimes no. In one of my first posts, on 'Machines, Stereotypes and Bias/Variance', I discussed this extensively in the second half. Overfitting, also known as Variance, is the condition that arises, when the model you build is too specific to the data itself, and may end up missing the underlying idea in pursuit of the details. This would be like thinking it was necessary to have a weird green pencil to do math well if you're 40, because your math teacher has one - it misses the point. This happens when you fit the model too well to the data, so you want to make sure to retain just the general trend, not the specific details in a model.
This was a short introduction to the topic. Stay tuned for my next blogpost, on the first Stochastic Optimization process: Simulated Annealing!
Comments