top of page

Data OilSt.

  • Sai Manoj

Machine Learning: Simple Linear Regression

What is Simple Linear Regression?

Simple linear regression is a statistical technique used for finding the existence of an association relationship between a dependent variable and an independent variable. We can only establish that change in the value of outcome variable (Y) is associated with change in the value of feature(X),that is, regression technique cannot be used for establishing causal relationship between two variables.

What is Regression?

Regression is one of the most popular supervised learning algorithms in predictive analytics. A regression model requires the knowledge of both the outcome and the feature variables in the dataset.

The following are a few examples of linear regression problems:

  1. A hospital may be interested in finding how the total cost of a patient for a treatment varies with the body weight of the patient.

  2. Restaurants would like to know the relationship between the customer waiting time after placing the order and the revenue.

  3. E-commerce companies such as Amazon, big basket etc would like to understand the impact of variables such as unemployment rate,marital status,balance in the bank account, rainfall etc. on the percentage of non-performing assets(NPA).

  4. Insurance companies would like to understand the association between healthcare costs and ageing.

  5. An organisation may be interested in finding the relationship between revenue generated from a product and features such as the price, money spent on promotion,competitors price and promotion expenses.

What is Linear Function?

Let’s say that you went for movie, and you have to pay Rs.20 for parking ticket. Each movie ticket price Rs.150 ,and you have to buy an (X) tickets. It’s easy to predict the price based on the value and vice versa using equation of y=20+150x, which is in the form of equation y=mx+c(Linear Function).

A Linear Function has one independent and dependent variable. From the equation y=mx+c we can say that x is independent variable and y is dependent variable.

1. c is the y-intercept which is obtained when x=0.

2. m is the slope and gives the rate of change of the dependent variable.

But in daily life, things are different:

Let’s take one example, waist Vs weight distribution as per below table

The problem is if you were in vacation and you are foodie but you are health conscious then you daily think about your weight. Then what is the best way to predict your weight based on your waist measurement.

From the above scatter plot we can say that all the observation are not in line but they are in a line shape. So we can say it’s linear. Now we will predict the weight based on the waist using Machine Learning.

In this section ,we will predict the weight using theoretical calculation to predict the best weight Vs waist.

Steps in theoretical calculations:

1. How to Find the Regression Equation:

In the table below, the x column shows waist Similarly, the y column shows weight.

To conduct a regression analysis model we need to find b0 and b1.

First, we solve for the regression coefficient (b1): b1 = Σ [ (xi - x)(yi - y) ] / Σ [ (xi - x)2]

we can solve for the regression slope (b0): b0 = y - b1 * x

2️. How to use the Regression Equation:

Once you have the regression equation, choose a value for the independent variable(x), perform the computation,and you have an estimated value(y)for the dependent.

3️. How to find the coefficient of determination:

Whenever you use a regression equation,you should ask for how the equation fits the data. One way to assess fit is to check the coefficient of determination which can be computed from the following formula.

R^2 = { ( 1 / N ) * Σ [ (xi - x) * (yi - y) ] / (σx * σy ) }^2

where N is the number of observations used to fit the model, Σ is the summation symbol, xi is the x value for observation i, x is the mean x value, yi is the y value for observation i, y is the mean y value, σx is the standard deviation of x, and σy is the standard deviation of y.

A coefficient of determination equal to 0.996 indicated that about 99% of the variation in the dependent variables can be explained by the relationship to the independent variable.

Implementation Using Python:

  1. We will use 3 libraries - Pandas to work with dataset, Sklearn to implement machine learning functions, and Matplotlib to visualize our plots.

  2. Importing the dataset, data exploration, shape of data, and plot them.

  3. Changing to dataframe variables.

  4. Building the linear regression model(i.e slope, intercept, evaluate the model).

  5. Predicting more values.

  6. Visualizing the results, Plotting the regression line, Plotting the predict value.

You can observe we have built simple linear regression with coefficient (slope), intercept and new predicted values. From the plotted graphs we can see the predicted value which is spotted in black colour on red line which is linear regression line.

bottom of page