Adarsh Menon Adarsh is a tech & data science enthusiast. In his own words, “I make websites and teach machines to predict stuff. I also make YouTube videos — https://www.youtube.com/adarshmenon"

# Learn how logistic regression works and ways to implement it from scratch as well as using sklearn library in Python

)

In statistics, logistic regression is used to model the probability of a certain class or event. I will be focusing more on the basics and implementation of the model, and not go too deep into the math part in this post. Just to give you a heads up, this article is a written version of the video tutorial that can found here

Logistic regression is similar to linear regression because both of these involve estimating the values of parameters used in the prediction equation based on the given training data. Linear regression predicts the value of a continuous dependent variable. Whereas logistic regression predicts the probability of an event or class that is dependent on other factors. Thus the output of logistic regression always lies between 0 and 1. Because of this property, it is commonly used for classification purpose.

[Learn Data Science from this 5-Week Online Bootcamp materials.]

# Logistic Model

Consider a model with features x1, x2, x3 … xn. Let the binary output be denoted by Y, that can take the values 0 or 1.
Let p be the probability of Y = 1, we can denote it as p = P(Y=1).
The mathematical relationship between these variables can be denoted as

Here the term p/(1−p) is known as the odds and denotes the likelihood of the event taking place. Thus ln(p/(1−p)) is known as the log odds and is simply used to map the probability that lies between 0 and 1 to a range between (−∞,+∞). The terms b0, b1, b2… are parameters (or weights) that we will estimate during training.

So this is just the basic math behind what we are going to do. We are interested in the probability p in this equation. So we simplify the equation to obtain the value of p:

1. The log term ln on the LHS can be removed by raising the RHS as a power of e:

2. Now we can easily simplify to obtain the value of :

This actually turns out to be the equation of the Sigmoid Function which is widely used in other machine learning applications. The Sigmoid Function is given by:

Now we will be using the derived equation above to make our predictions. Before that, we will train our model to obtain the values of our parameters b0, b1, b2… that results in the least error. This is where the error or loss function comes in.

# Logistic regression loss function

The loss is basically the error in our predicted value. In other words, it is a difference between our predicted value and the actual value. We will be using the L2 Loss Function to calculate the error. Theoretically, you can use any function to calculate the error. This function can be broken down as:

1. Let the actual value be yᵢ. Let the value predicted using our model be denoted as ȳᵢ. Find the difference between the actual and predicted value.
2. Square this difference.
3. Find the sum across all the values in training data.

Now that we have the error, we need to update the values of our parameters to minimize this error. This is where the “learning” actually happens since our model is updating itself based on its previous output to obtain a more accurate output in the next step. Hence with each iteration, our model becomes more and more accurate. We will be using the Gradient Descent Algorithm to estimate our parameters. Another commonly used algorithm is the Maximum Likelihood Estimation. The loss or error on the y axis and number of iterations on the x axis.

You might know that the partial derivative of a function at its minimum value is equal to 0. So gradient descent basically uses this concept to estimate the parameters or weights of our model by minimizing the loss function. Check out the below video for a more detailed explanation on how gradient descent works.

For simplicity, for the rest of this tutorial let us assume that our output depends only on a single feature x. So we can rewrite our equation as:

Thus we need to estimate the values of weights b0 and b1 using our given training data.

1. Initially let b0=0 and b1=0. Let L be the learning rate. The learning rate controls by how much the values of b0 and b1 are updated at each step in the learning process. Here let L=0.001.
2. Calculate the partial derivative with respect to b0 and b1. The value of the partial derivative will tell us how far the loss function is from its minimum value. It is a measure of how much our weights need to be updated to attain minimum or ideally 0 error. In case you have more than one feature, you need to calculate the partial derivative for each weight b0, b1 … bn where n is the number of features. For a detailed explanation on the math behind calculating the partial derivatives, check out my video.

3. Next we update the values of b0 and b1:

4. We repeat this process until our loss function is a very small value or ideally reaches 0 (meaning no errors and 100% accuracy). The number of times we repeat this learning process is known as iterations or epochs.

# Implementing the Model

Import the necessary libraries and download the data set here. The data was taken from kaggle and describes information about a product being purchased through an advertisement on social media. We will be predicting the value of Purchased and consider a single feature, Age to predict the values of Purchased. You can have multiple features as well.

We need to normalize our training data and shift the mean to the origin. This is important to get accurate results because of the nature of the logistic equation. This is done by the normalize method. The predict method simply plugs in the value of the weights into the logistic model equation and returns the result. This returned value is the required probability.

The model is trained for 300 epochs or iterations. The partial derivatives are calculated at each iteration and the weights are updated. You can even calculate the loss at each step and see how it approaches zero with each step.

Since the prediction equation returns a probability, we need to convert it into a binary value to be able to make classifications. To do this, we select a threshold, say 0.5 and all predicted values above 0.5 will be treated as 1 and everything else will be 0. You can choose a suitable threshold depending on the problem you are solving.

Here for each value of age in the testing data, we predict if the product was purchased or not and plot the graph. The accuracy can be calculated by checking how many correct predictions we made and dividing it by the total number of test cases. Our accuracy seems to be 85%.

[Get Started with Deep Learning with this free Bootcamp materials.]

# Implementing using Sklearn

The library sklearn can be used to perform logistic regression in a few lines as shown using the LogisticRegression class. It also supports multiple features. It requires the input values to be in a specific format hence they have been reshaped before training using the fit method.

The accuracy using this is 86.25%, which is very close to the accuracy of our model that we implemented from scratch!

Thus we have implemented a seemingly complicated algorithm easily using python from scratch and also compared it with a standard model in sklearn that does the same. I think the most crucial part here is the gradient descent algorithm, and learning how to the weights are updated at each step. Once you have learned this basic concept, then you will be able to estimate parameters for any function.

Click Here for the entire code and explanation in a Google Colaboratory. You can use it to explore and play around with the code easily.

# Video Tutorial on Logistic Regression

Note: This article was originally published on towardsdatascience.com, and kindly contributed to DPhi to spread the knowledge.

References

Become a guide. Become a mentor.
We at DPhi, welcome you to share your experience in data science – be it your learning journey, experience while participating in Data Science Challenges, data science projects, tutorials and anything that is related to Data Science. Your learnings could help a large number of aspiring data scientists! Interested? Submit here

## Call for Volunteers to Coach Learners for the Data…

Anyone who is passionate about Data Science & Machine Learning and is looking forward to making a difference by being a part of our...

## One year of DPhi – it is still day…

As all ambitious journeys have humble beginnings, we had ours too. It was a year back, still remember those intense days scouting for speakers...

## Top Dash Applications Submissions – Data Analysis & Visualizations…

We thoroughly enjoyed hosting Data Analysis and Visualization 101 Bootcamp where we saw enthusiastic participation from several learners across the globe. During the Bootcamp we...

## 4 Replies to “Tutorial on Logistic Regression using Gradient Descent with Python”

1. Ravindra says:

Thank you for such an elegant code. What changes one has to make if input X is of more than one columns

2. Manish says:

Hi Ravindra,

If you are thinking to build from scratch, the number of coefficients will increase. For example, in the example shown above, there is one column in X, so there are two constant D_b1 as coefficient and D_b0 as intercept. Let’s say you have two columns in X, there will be three constant values, two coefficient as D_b1 and D_b2 and one intercept i.e. D_b0 and so on.

If you are building the model using sklearn, you don’t need to do any changes.

3. Kartik says:

That’s just using libraries lmao

4. Chris Tralie says:

This is an awesome tutorial, thank you! It really helped me to understand this better. One thing I’m wondering, though, is why you chose squared loss. I did some more reading and realized that the squared loss is not convex, so you’re not guaranteed to have a global minimum. Instead, if you use the loss function

-y*log(logistic(x)) – (1-y)log(1-logistic(x))

then this is convex. See here:
http://mathgotchas.blogspot.com/2011/10/why-is-error-function-minimized-in.html
It also leads to a super slick and simple update rule. For b1, for example, it will be

y-y_pred 