Skip to main content

Stochastic Optimization: NEW MINISERIES!


This is part of my new miniseries on Stochastic Optimization. While this is not taught in a lot of Machine Learning courses, it's an interesting perspective, applicable in an incredible number of fields. Nevertheless, this won't be a very long series, and when we exit it, it'll be time to dive straight into our first Machine Learning algorithm!

Introduction to Optimization:

Ok, so what is Optimization? As the name may suggest, Optimization is about finding the optimal configuration of a particular system. Of course, in the real world, the important question in any such process is this: in what sense?
i.e. By what criteria do you intend to optimize the system?
However, we will not delve too much into that just yet, but I promise, that will bring about a very strong connection to ML.

Introduction to Stochastic Optimization:

So far, as part of our blogposts, we have discussed Gradient Descent and the Normal Equation Method. These are both Optimization algorithms, but they differ significantly from the ones we intend to cover in this miniseries. We can see why in a quick example. Our understanding of Gradient Descent was based on the analogy of trying to find your way to the lowest point available from a hillock. 

Technical Note: There are actually three kinds of Gradient Descent, and these differ in their determinism, but that's not very important here.

In the version of Gradient Descent we talked about, if you start at a given point for a given terrain, you MUST reach the same minimum point no matter what. This is known as a deterministic approach, i.e. the goal state is already predetermined once you pick the starting point. 

However, at the end of the second post on Gradient Descent, I mentioned a drunk man, staggering down the hill. Given the same starting point on the hill, it isn't necessary that he will always end up in the same place. There's a particular random component to him, and it's not easy to be able to tell where he's going to go next. Of course, there are advantages and disadvantages to that, and we will get to that in a minute, but it should be obvious that this is not a deterministic approach. Instead, this is known as a Stochastic Optimization process. Stochastic means probability or randomness, giving this variant its name: Stochastic Gradient Descent. There are other kinds of Gradient Descent, and I hope you will explore them - or just wait till we get to them.

Advantages and Disadvantages of Stochastic Approaches:

Nevertheless, now that we have defined a stochastic optimization approach, let's think about what the advantages and disadvantages of using one could be. Going back to the drunk man, we see that he takes relatively less time to reach the base of the valley, which signifies that stochastic optimization approaches take less time to 'converge' to a reasonably good minimum. On the other hand, since he's inebriated, he will find it much more difficult to accurately pinpoint the exact minimum point, and reach there, suggesting that while stochastic processes give good results in little time, deterministic approaches can be much more accurate. 

Those are somewhat obvious differences. However, there's another, very important difference that we need to consider. Before that, though, we need some more background - back to Calculus!

In the second Gradient Descent post (You should really read that, if you haven't already. It will make life a lot easier.), we discussed the concept of local minima in Calculus. In essence, local minima are points around which all the points are greater than itself. Another way to put this, is that you would be able to choose some interval around the point where all other values are greater than its value. We considered its importance, and realized that the optimization approaches we've covered thus far are nothing but ways to find these points. However, there is another kind of minimum. It's called a Global Minimum.
While a graph may have many local minima, however, there is almost always a particular point you would have designated the minimum before Calculus. This point is less than ALL other values of the function, and is known as the Global or Absolute Minimum. Given the context of finding the parameters that minimize the cost function, wouldn't it be nice to be able to find the Global Minima? (Actually, maybe not.)

However, if you have not noticed already, deterministic processes are susceptible to being caught in local minima. That's like the sober man sitting down for a picnic at the lowest spot his path lead him to because walking down a hill over and over for some obscure Machine Learning blog can be tiring. This point may not necessarily be the lowest point in the entire terrain, but it's the lowest one he found, and he's happy with that. 

On the other hand, consider the drunk man. Since he didn't really understand the goal of reaching a minimum anyway, he's taking random steps, and sees no reason to stop when he reaches the lowest visible spot. While this may end up taking him some more time, winding through the combinations, he might eventually reach the global minimum after all. 

This is the unseen advantage of stochastic approaches: they are less susceptible to local minima. 

Overfitting:

With all this emphasis on the global minima, one may wonder, however. "Is it really all that great an idea to find the near-perfect model for the data?"
The answer is sometimes no. In one of my first posts, on 'Machines, Stereotypes and Bias/Variance', I discussed this extensively in the second half. Overfitting, also known as Variance, is the condition that arises, when the model you build is too specific to the data itself, and may end up missing the underlying idea in pursuit of the details. This would be like thinking it was necessary to have a weird green pencil to do math well if you're 40, because your math teacher has one - it misses the point. This happens when you fit the model too well to the data, so you want to make sure to retain just the general trend, not the specific details in a model. 

This was a short introduction to the topic. Stay tuned for my next blogpost, on the first Stochastic Optimization process: Simulated Annealing!


Comments

Popular posts from this blog

Phase Spaces 1 : Graphs and Geometry

Phase Spaces One of the least heard of, and most interesting techniques of the sciences, that you rarely realize you’ve used before. Phase spaces are symbolic representations of a particular problem, which you can then use to solve it. Let’s start with a simple problem - in physics maybe. Let’s say we have a car, as all good physics problems do. You’re driving at a set initial speed, and a set acceleration. At what time would you have travelled exactly 15 ft? Let’s look at it in terms of "a phase space". I have a velocity-time graph down here:                                                                                                                                ...

StOp 1.1: Anvils, Annealing and Algorithms

Introduction: Now that the strange title has attracted you to the article, StOp stands for Stochastic Optimization. This is the first episode in our mini-series. I've been mulling over this article for months now, which is kind of absurd considering that this is meant to be a quick series, but I apologize for my online dormancy. In the meanwhile, I was working on writing content for a course on Machine Learning. If you're still in school (not college), and you want to learn more, check out:  https://code-4-tomorrow.thinkific.com/courses/machine-learning At any rate, let's get started. Expansion and Exploitation: In some ways, the more of this you read about, the more you begin to think of the world as an array of optimization processes - from the bargain you settle on with the grocer to the conversation you had before you sold your company. But an unfortunate side-effect of this kind of outlook, is that you often become a visibly more selfish person. You spend more time exp...

Phase Spaces 2 : Math and Gradient Descent

I'm going to start from where we left off in the last part of this series. If you haven't read that yet, check that out first if you want a more detailed understanding: We explored what a Phase Space is, why it's useful, what it has to do with Machine Learning, and more!  I'm assuming you've read the previous article, or you know what I talked about there: so let's get to it. At the end of the last article, we discovered that it was the power of mathematics that would help us find the best values of the parameters for the lowest cost function. Before we get into what the Math does, however, we'll need to define some things in the math. If you've done Calculus, and in particular, partial derivatives, you can skip this section, but otherwise I would suggest at least a cursory glance. I don't go into too much detail on the subject, but that's only because you won't need it.  Calculus Interlude: Derivatives- The slope of a graph is a concept you...