To Noobs With Love: Machine Learning: Support Vector Machines
Support Vector Machine (SVM) is a relatively boring machine learning model that is used to classify data into two groups. After providing the SVM model with some training data, it will be able to categorize any new data into two of the possible categories.
How Does it Work?
How a basic SVM model works is best understood with a simple example. Let’s imagine starting off by weighing a bunch of cats and dogs. On the number line below, the red crosses are the weights of cats, and the blue crosses are dogs.
Based on these observations, we can draw a threshold (boundary) for the weight on the number line which separates the cats and dogs by weight.
In this case, if we introduce a new observation that has a mass lower than the threshold, we can classify that pet as a cat. Similarly, if the pet’s weight is more than the threshold, we can classify it as a dog.
However, what would happen if we get a new observation very close to our current threshold? Because this observation has less weight than the threshold, our classifier would identify the new pet as a dog.
But this would not make any sense because the new pet’s weight is very much closer to the observations we had for the weights of dogs. I guess we can see that the threshold we picked is “totes not on fleek”. How do we come up with a better threshold?
We can go back to our observations and focus on the observations on the edges of each category. Then we can use the midpoint between them as the threshold. The shortest distance between the threshold and the observations is called the margin.
Let’s imagine getting a new pet whose weight is closer to that of cats from the threshold than it is to dogs. In this case, we can classify this new observation as a cat.
Misclassifications and Outliers
On one bright and sunny morning, somebody decides to mess with our classifier and brings in a Chihuahua (very small dog AKA SATAN’S SPAWN).
Now our training data will look like this. We have an observation that was identified as a dog, but is much closer to cats in terms of weight.
Now, if we draw the margin in between the Chihuahua and the heaviest cat, the new classifier will be messed up. How do we fix this?
Observations such as very small dogs are known as outliers. In order to make a threshold that would not be sensitive to outliers, we must allow misclassifications. For example, if we kept the threshold at the same position, we would misclassify the Chihuahua as a cat. However, now if we get a new observation of a cat slightly heavier than the Chihuahua, our classifier will still identify that it is a cat.
When we allow misclassifications, the distance between the observations and the threshold is called a Soft Margin. When we use a soft margin to determine the location of a threshold, we are using a support vector classifier to classify observations.
Two Dimensional Data
Now let’s add height to the observations. The data we now plot will be two dimensional.
When the data is 2 Dimensional, the Support Vector Classifier is a best fit line. This best fit line searched through the
.In the above diagram, we can see one straight line and two dotted lines. The straight line is drawn equidistant from one of the cat observations and one of the dog observations. This distance is the margin. The SVM algorithm draws this line in such a way that the sum of the two equal margins is the highest possible value.
The two points indicated in the diagram above are known as support vectors. They are called that because they are supporting the SVM algorithm. They are called vectors instead of points because in many other cases, an SVM model can have more than two dimensions (independent variables).
The line in the middle is called the Maximum Margin Hyperplane. In a two dimensional plane, this is simply a line. However. It is called a hyperplane in a multi dimensional space. The line on one side of the hyperplane will be called the Positive Hyperplane and the other will be called Negative Hyperplane. The positive and negative hyperplanes give us a sense of where all of the other points are in relation to the hyperplanes.
We can see one object between the hyperplanes. Our SVM has misclassified this as a cat as it is an outlier compared to the margins.
Handling Training Data with Overlap
Let’s consider another example where we compare the daily meat consumption of a group of people and compare their health. The diagram below is a number line indicating the amount of meat consumed by a person daily and whether they are healthy or not. The red crosses indicate that they are not healthy, and the green crosses indicate that they are healthy.
As we can see, it’s not healthy to not eat meat, and it’s also not healthy to eat too much meat. PS. I’m not a dietician. I have no idea if this is true in the real world. In this case, we are bound to make a lot of misclassifications no matter where our classifier is put on the line. How are we going to solve this? This is where Support Vector Machines come into place.
One approach to this is to introduce a Y-axis to the number line. In this case, our new graph will have the amount in the X-axis, and the square value of the amount on the Y-axis.
Since each observation now has X and Y axis coordinates, the data are now 2 dimensional. Since the data is now 2 dimensional, we can draw a Support Vector Classifier that separates the healthy and unhealthy people.
The Support Vector Classifier we created can be used to classify and predict new observations.
When creating models for Support Vector Machines, we should follow these rules.
Start with data in a relatively low dimension (our example started at 1 dimension).
Move the data to a higher dimension (we moved from 1 to 2 dimensions).
Find a Support Vector Classifier that separates the higher dimensional data into two groups.
You may be wondering why we squared the amount instead of doing pretty much anything else. The truth is there is anything we could have done to get to a higher dimension. So how do we decide how to transform the data?
In order to do so, Support Vector Machines use something called a Kernel Function to systematically find Support Vector Classifiers in higher dimensions. In our case, we used a Polynomial Kernel with a degree of 2.
I will probably talk about Kernels in another blog. However, another very commonly used Kernel is the Radial Kernel, also known as the Radial Basis Function (RBF) Kernel.
I’m bad at conclusions. That’s it. K Bye!