Introduction:
Now that the strange title has attracted you to the article, StOp stands for Stochastic Optimization. This is the first episode in our mini-series. I've been mulling over this article for months now, which is kind of absurd considering that this is meant to be a quick series, but I apologize for my online dormancy. In the meanwhile, I was working on writing content for a course on Machine Learning. If you're still in school (not college), and you want to learn more, check out:
https://code-4-tomorrow.thinkific.com/courses/machine-learning
At any rate, let's get started.
Expansion and Exploitation:
In some ways, the more of this you read about, the more you begin to think of the world as an array of optimization processes - from the bargain you settle on with the grocer to the conversation you had before you sold your company. But an unfortunate side-effect of this kind of outlook, is that you often become a visibly more selfish person. You spend more time exploiting in your conversations, discussions and dialogues, as you try to spend the least amount of time and derive the maximum benefit. And yet, it is clear with some maturity, that this approach isn't sustainable. If all you think of is your own purpose, your interactions can be frustrating to the people you are speaking to, and naturally, they won't help you for long. Ok, let's leave that thread where it is - don't forget about it though.
Physics is an important part of this blog, so let's think about it for a second. In particular, let's consider the physics of heating and cooling. Blacksmiths often began with liquid metal, at high temperature. As it cools, the the metal solidifies, and becomes a well-structured solid. Physics studies this process in terms of atoms, particles and molecules. When the metal is at a high temperature, particles move around within it, at high speed. As the temperature falls, the particles settle further and further into a stable configuration, until the zero temperature is reached, when they stop moving, in one of the most optimal configurations. That's another thread, so now let's tie them together.
Stochastic Optimization can be used to find optimal configurations - even to sort lists of numbers or determine the ideal cat image to act as clickbait. |
Revisiting Gradient Descent:
Remember when we discussed what Gradient Descent does? Well, let's think about the same process. What we noticed in our intro to this series is that Gradient Descent often stops at local minima, because at that point, the gradient is zero. This, however, may not be the global minimum. So how could we find the global minimum?
Before we actually get to stochastic optimization, a key distinction between calculus-based/gradient-based optimization and stochastic optimization is that the former works on continuous functions. However, the second works on discrete functions or combinatorial problems, only able to take specified values. Since many optimization problems involve continuous functions, you can sample the continuous function at specified intervals to produce a discrete function.
The problem with Gradient Descent is that as it continues to seek lower values, at some point, by the definition of a local minimum, the function values around a local minimum are always greater than its value, so the optimization just stops there. So if we really want to get to a global minimum, we need to make some provision for the pointer to move out of a local minimum in search of a global minimum. But how do we balance between seeking minima and escaping local minima?
The Balance - Stochastic Optimization:
Think back to the conversation we wanted to balance. We need to exploit to some extent - because otherwise your goal isn't being met -, and we also need to make some concession for the other person, to keep them motivated. This kind of thing is referred to in Optimization as explore. So in essence, you need to both exploit and explore. This is the core of Stochastic Optimization, and is what makes it so powerful.
But, you might be asking, what does Stochastic mean? Stochastic = probabilistic, meaning that a stochastic process has some component of randomness and probability to it, and it won't always produce the same results for the same inputs. This is our answer as to how to balance between exploiting and exploring and looking for downward movements, and exploring the function - probability.
Basically, what we are going to do first is to look at a neighboring state. If the state is below the current state, then the choice is easy: go down. But if it is above or equal to it, then we go there only with some probability. Great - that should solve everything, right?
Simulated Annealing:
Well, not quite. You see, if we provide some constant probability by which the optimization algorithm chooses whether or not to explore, the algorithm would never reach the global minimum. Instead, it would keep bouncing all around the function infinitely, and we'd never get anything out of it. So we need a probability that decreases over time, so that the pointer can settle into the global minimum.
Settle into..... That sounds familiar.
A quick ctrl+f should find it for you: "As the temperature falls, the particles settle further and further into a stable configuration, until the zero temperature is reached, when they stop moving, in one of the most optimal configurations."
Now, our first Stochastic Optimization algorithm emerges. It's called Simulated Annealing, named after the process by which blacksmiths solidify metals. It's distinct feature is the presence of what is called a temperature variable. While this doesn't really have anything to do with the real temperature of your room or your computer (although this algorithm can occasionally be computationally expensive, so your computer might heat up if you ran it locally), thinking about the particle metaphor will help you see what it does to the pointer. When your optimization process begins, you have an initial, maximum temperature. This temperature decreases as your pointer moves about, until it reaches the designated final temperature that makes your pointer stop moving. In turn, if you consider that when the temperature of the metal was high, the particles were more motivated to random movement, while a low temperature makes them less likely to take such erratic steps, we can use this to understand what our probability function should be. It should be some function of the temperature and how far above the current point the neighbor is, often denoted as cost - yes, they are related to cost functions.
Probabilities:
I'm not going to go into the exact mathematics of why we derived the probability function in this exact formula (to be frank, I don't know the full way), but here's our probability formula:
We now have a weighted probability, so now we have all the parts in place.
Now, let's put all the pieces together. When implementing Simulated Annealing, there are 3 main functions you need to set up:
1. Neighbors: Generates neighboring conditions for the current state.
For example, if each state is a number n, the relevant neighbors might be n-1, n+1, n-2, n+2. Of course, this all depends on the context. Still, it should be pretty intuitive, because it's just about changing the state slightly in all directions.
2. Cost: Metric for how good a configuration/state is.
In the case of just minimizing /maximizing a real function, the function itself is the cost function. In more innovative applications of Simulated Annealing, you should carefully consider what your cost function is, and how it is calculated.
3. Update: Considers neighboring states, and changes the current state, till a near-ideal state is reached.
Begins with the current state. Temperature begins at some maximum value. Use the Neighbors function to generate neighbors. Out of the possible neighboring states, a random one is selected. Then the Update function makes the following decision:
Calculate cost of neighboring state and cost of current state. If the neighboring state has a lower cost - no questions asked -, the current state is shifted to that neighboring state.
However, if it has a higher cost, we have to bring the probability into picture. If that probability produces a true value, we switch over to this neighboring state, although this becomes less likely as temperature falls. Eventually temperature reaches zero, and the algorithm has produced its final output.
Applications:
Next week, I'll upload some Python code for applications of the algorithm. Even if you don't know Python - or programming concepts - I'll provide a running commentary that will hopefully make it all comprehensible.
Comments