To Noobs with Love: Machine Learning: Simple Linear Regression
Welcome to my blog series on Machine Learning, where I plan on writing about the basic concepts of machine learning all the way up to some really advanced stuff (hopefully). As a machine learning noob myself, I will write these posts as I get to study more about the concepts of machine learning. Consider these to be my own notes as I study machine learning. Let’s start off with Simple Linear Regression.
What is Simple Linear Regression?
You might be familiar with plotting line graphs with one X axis and one Y axis. The values in the X axis are sometimes called “independent variables”, while the values in the Y axis are called “dependent variables”. Simple Linear Regression plots one independent variable X against one dependent variable Y in a line graph.
To explain things in a more formal way, Simple Linear Regression is a statistical method that allows us to summarize the relationship between two variables.
Simple Linear Regression Formula
Regression Analysis is a major part of data science, which is the process used to find equations that match a particular data set. Consider the following chart showing the relationship between the years of experience of a bunch of employees in a company, and their salary.
This type of representation is called a “scatter plot”. Each cross (x) in this diagram represents a single employee, where the X axis represents their years of experience, and the Y axis represents their salary. If we study the graph closely, we can see that the data appears to form a straight line.
When such data appears to form a straight line, we can use Simple Linear Regression to predict the salary of a future employee based on their experience. If we recall the algebra you learned when you were 5 years old, you’ll remember that the equation for a straight line is y = mx + c. However, statistics generally prefer to use the following equation.
y = b0 + b1x
y represents the dependent variable
x represents the independent variable
b0 and b1 are constants and are parameters (or coefficients) that need to be estimated from the data.
b0 is known as the intercept. This is the point in which the straight line touches the Y axis. In our example, this would be the predicted salary of a fresh graduate joining with no experience.
b1 indicates the slope of the line. This shows the increase in salary per year.
Best Fitting Line
The red line from the graphs above is known as the “Best Fitting Line”. This line represents the model for Simple Linear Regression.
The task of developing a Simple Linear Regression model is to come up with a best fitting line that represents a collection of data.
Once we draw this line, the model can easily predict the salary of a new employee based on how much experience they have.
The best fitting line above shows that a new employee with x1 years of experience should get a salary of y1.
How do we draw this best fitting line?
Let’s take a look at one employee. yi represents the salary of the employee, and yi^ represents what their salary should be according to the model. In technical terms:
yi is the actual observation
yi^ is the modeled observation
To figure out how good this line is, we take the sum of (yi - yi^)2 for all plotted values in the graph. This can be represented by this equation.
Linear regression draws all possible lines, gets the sum of all of them and uses this information to find the line having the minimum value for the sum of squares. This is called the ordinary least squares method.
And that is all you have to know about Simple Linear Regression! Here’s a quick recap of all you need to know about predicting values with Simple Linear Regression Model:
Get your dataset
Plot them on a graph
Figure out the best fitting line