Skip to main content

Machines, Stereotypes and Bias/Variance

Machine Learning is quite analogous to the way we learn. One of the most difficult things about learning is preconceptions- on a larger scale, stereotypes. They make it harder to learn something easily, when you have a simple and incorrect idea in mind. ML has not one, but two analogous problems, and this can be very important to make accurate models.


So, what are stereotypes? The first way to look at this is that they are training, done from a skewed data set. Let’s take an example. You live in a strange part of the world, where the prevalent flower is a rose. You’ve seen a rose a billion times, and you’ve observed a lot about it: it has thorns, a green stalk and red petals. You know a lot about them. But let’s say that in search of a better job, you set out towards the city. The city seems a bit bewildering, with all the noise and pollution, but the people seem alright, you decide. And then you find the shop. You duck under the huge label that says ‘John’s Floral Arrangements’, and enter the shop. The first bouquet you see seems normal- roses sticking out of the weathered plastic, red and thorny as ever. Then you pass the next bouquet, and the next one. 

What kind of weird place is this? There’s something yellow in them. You storm up to the manager and ask him why he’s selling fruits in bouquets. After a weird look, the manager replies that it’s not a fruit, but a Marigold, a kind of flower. And this makes no sense to you.


Before we see what this has to do with ML, we have to introduce a couple of new terms. An algorithm is trained on a training data set, and once it has formulated the best model it can think of, by reducing the cost function, we need to test it. After all, we need to see if it works. What engineers tend to do for this, is to split the data they have collected. Suppose they have collected data about 50 kinds of flowers - they know their widths, heights, stalk colors, thorniness, petal colors, etc. 

They then partition this data into 3 groups: training, cross-validation and testing. We’ve already discussed the training dataset. You can ignore the cross-validation data set for now, we’re just going to use that to make our models more accurate in a different way. The test dataset is data we can use to check how well our model works. Engineers have developed many ways of evaluating this, but we’ll keep that for later as well. The important idea is this: Let’s say we’re trying to predict what number a thrown dice will display. Not as a random thing, but rather given inputs(known in Machine Learning, as features) of things like the dices’ initial vertical speed, its initial orientation, air resistance, etc. we try to create a Machine Learning model to predict whether a given dice throw will be a 6 or not, so you can start strategizing early in Monopoly. OK, so let’s just say that we have now collected all the data for 50 different throws and fed them( as a training set ) to the algorithm to develop a model from. The way the outputs are given for this model is a 1 if the dice rolls 6, and a 0 otherwise. 

We then test this model on a new data set. And by some calculations, we see that the model is right 90 % of the time. That sounds great, and we should be done, right?

Wrong. What we also notice is that the model has never predicted a 6 correctly during testing. Without that, the model is not really serving its purpose, despite its high accuracy. What do we do now?

Examine the model that the algorithm developed. What we might see is that the model takes all inputs, but always predicts 0. In other words, no matter what inputs or features you give it, the model will always predict a number other than 6 for the dice. How could this have such high accuracy, you ask? Well, when you throw a six-sided dice, there’s roughly a 17% chance you’ll get a 6. With some additional impact of the surroundings, it could become only a 10% chance. When your algorithm always says you’ll miss the 6, and most of the time you will, it seems very accurate. But in reality, the model is actually quite dumb, because eventually- it might take a very long time, but still - you will throw a 6, and the model will have never seen it coming. This is a very easy problem to miss, and in Machine Learning, is known as the problem of skewed classes. There are two kinds of models created by a Machine Learning algorithm: regression and classification. Regression means a real-valued output, or in other words, numbers. It could tell you a predicted stock market price for a company tomorrow, or what would be a reasonable price to sell your house for. On the other hand, classification is what it says it is - it classifies data into different set groups, deciding whether a college will admit you or not, or if a given product will meet its quarterly goal. 

There are technically other ways of grouping data, but these lie on the other side of the chasm between supervised and unsupervised learning, and we’re still figuring out supervised learning. Skewed classes are one of the difficulties that plague classification. When you have two options( known as classes) , and one is WAYYY more likely to happen than the other, lots of nonsensical predictions will run quite accurately, although they completely miss the point. This is just like you training on flowers, and deciding that roses were the only flower. Even though there probably were other flowers in your village, you weren’t made fun of too much if you didn’t recognize them since they were so rare, and so you had a pretty high accuracy, despite a rather foolish model. When you moved to the city, however, your accuracy plummeted, like running a particularly testing test data set, if you will permit the pun. The important thing, is to make sure you train your model on a data set of decent variety - roses, lotuses, marigolds, chrysanthemums, and whatnot - and then make sure your model is sensitive to new inputs. 


Which brings us to the other aspect of human behavior synonymous to machines: bias. 


Let’s say as a kid, a dog bit you, and in the fear of rabies, you had to take 14 injections in the stomach. Not a great experience. Since then, you’ve feared dogs. In fact, you’ve hated them. You tremble at the sound of a bark and whimper at the sight of a ball of fur hurtling down the street. But now, you want to work at an NGO, out of some altruism. Unfortunately for you, the only one available is one that treats injured dogs, and you have no choice but to join them. The dogs that come to your NGO are mostly thin, weak and don’t eat much. But they’re very playful, and are always around your knees, licking you, and showing other signs of affection. Over time, you decide dogs cannot be bad, and that you must have just been mistaken in your childhood. All dogs must be thin, weak and loving. So when you next go out onto the street, you’re not afraid of the pack of stray dogs that surround you. And then they bite, and you have to do the 14 injections all over again.


After that tale of betrayal, you may be a little saddened. So let’s talk about how to prevent this tragedy from happening to your learning algorithm. What you had over here was bias. First, you built a model of the qualities of dogs based on the first dog that bit you - fearsome, loud, hungry and evil( at least in your mind ). Then, when you worked at the NGO, you changed your mind about dogs, and you decided that they were loving, quiet, satiated  and gentle. But instead of accounting for both your experiences - seeing dogs bite and dogs lick - you just maintain a simplistic model, in which dogs are either good or bad. And naturally, you’re not careful when surrounded by a hungry pack of stray dogs, and you’re bitten.   

Some Machine Learning algorithms will end up developing such models for applications which need much more nuance. This is called a high bias configuration. What this means, is that the algorithm develops a simplistic model, say, a straight line to fit some data which is really not in the shape of a line. The reason for the name is that the algorithm appears to have a bias here, in that it has a very simplistic idea, which beyond a point, it is no longer willing to develop or improve, to fit the data, much like how you hypothetically refused to maintain a nuanced idea of the character of dogs in general. In order to fix high bias, you can take the following measures:


  1. Take more features or inputs. 

    This automatically makes your model more nuanced. In effect, this would be 

    you refusing to come to a conclusion too quickly about dogs, and instead 

noting both their licks and how angrily they bark at unfamiliar visitors. In the 

future, we will examine why this is true mathematically, but that’s for later. 

  1. Reduce Regularization.

We will examine regularization on its own in the future, but what regularization

effectively does, is suppress how important your inputs are to what your model looks like. By reducing regularization, you give more importance to your inputs, and less to your preconceived notions. This would be like becoming more aware of the canines around you, and instead of trying to see them as either good or bad based on your previous experiences( which here represent a preconceived notion ), trying to be the judge of that taking into consideration how they behave with you now. 

The opposite of high bias, is high variance. This is not good either. To continue with the analogy, in order to have a high variance model for dogs in your mind, you would have to be very detail-oriented. You would end up trying to fit everything into your model, instead of accepting that some parts of your data are just outliers - the dog that first bit you was very afraid of humans, and the dogs you treated at the NGO were far too weak to hurt you anyway. Your model for a dog should not try too hard to account for these things, or else you will end up with an excessively complicated model that doesn’t scale realistically to real-life.

As we expand on the great experiment of machines learning, it's important for us to understand, regardless if their intelligence is anything like ours, that some of our problems have analogs in their world, and it’s best to make use of the millenia of learning experience humans have accumulated to teach machines too.




   

Comments

Popular posts from this blog

Phase Spaces 1 : Graphs and Geometry

Phase Spaces One of the least heard of, and most interesting techniques of the sciences, that you rarely realize you’ve used before. Phase spaces are symbolic representations of a particular problem, which you can then use to solve it. Let’s start with a simple problem - in physics maybe. Let’s say we have a car, as all good physics problems do. You’re driving at a set initial speed, and a set acceleration. At what time would you have travelled exactly 15 ft? Let’s look at it in terms of "a phase space". I have a velocity-time graph down here:                                                                                                                                  Linear Velocity-Time Graph Nothing very exciting, but it’s a useful analogy. Here, the two variables involved (more on that later), are effectively the speed and the time. What you want to know are the success cases (totally a technical term), where the car travels 15 ft, no more, no less. How could you do tha

Phase Spaces 2 : Math and Gradient Descent

I'm going to start from where we left off in the last part of this series. If you haven't read that yet, check that out first if you want a more detailed understanding: We explored what a Phase Space is, why it's useful, what it has to do with Machine Learning, and more!  I'm assuming you've read the previous article, or you know what I talked about there: so let's get to it. At the end of the last article, we discovered that it was the power of mathematics that would help us find the best values of the parameters for the lowest cost function. Before we get into what the Math does, however, we'll need to define some things in the math. If you've done Calculus, and in particular, partial derivatives, you can skip this section, but otherwise I would suggest at least a cursory glance. I don't go into too much detail on the subject, but that's only because you won't need it.  Calculus Interlude: Derivatives- The slope of a graph is a concept you

Stochastic Optimization: NEW MINISERIES!

This is part of my new miniseries on Stochastic Optimization. While this is not taught in a lot of Machine Learning courses, it's an interesting perspective, applicable in an incredible number of fields. Nevertheless, this won't be a very long series, and when we exit it, it'll be time to dive straight into our first Machine Learning algorithm! Introduction to Optimization: Ok, so what is Optimization? As the name may suggest, Optimization is about finding the optimal configuration of a particular system. Of course, in the real world, the important question in any such process is this: in what sense? i.e. By what criteria do you intend to optimize the system? However, we will not delve too much into that just yet, but I promise, that will bring about a very strong connection to ML. Introduction to Stochastic Optimization: So far, as part of our blogposts, we have discussed Gradient Descent and the Normal Equation Method . These are both Optimization algorithms, but they di