Logistic Regression — Classification or Regression?

5 min readSep 20, 2021

In one of my recent interviews I was asked one of a routine question which a fresher applying for a data science related job role gets asked — “What kind of algorithm is logistic regression?” Answered —Classification. My own answer triggered a question in my mind or maybe my alleged broken sense of humour did, “if it’s a classification algo why does it has regression in its name?”

Regression vs Classification

Regression and Classification algorithms both belong to supervised learning domain. To understand in beginner level terms -

Regression: Algorithms used to predict continuous values using dataset. Example- number of car bookings for the upcoming quarter, effects of fertiliser and water on crop yield, detection of trends in future options and movement of a stock and even a medical operation efficiency. Meaning it’s a next-stop after correlation. We use feature(s) to predict continuous values.

Classification: As the name calls, it is used to classify. Whether an e-mail is spam or not, a tumour is benign or malignant, a credit card transaction is fraudulent or not.

In this article we’re gonna focus on and understand Logistic regression by implementing a LogisticRegression() on a credit card fraud detection dataset acquired from Kaggle.

The needful mathematics behind it

Linear regression, as previously stated, predicts a continuous value based on the relationship between dependent and independent variables. If Y (dependent variable) is a function of X (independent variable) then Y is represented as:

Using this mathematical expression a “best fit line” is calculated and we predict a real-valued output y based on a weighted sum of input variables.

Now using the same data, what if we were asked to detect which value belongs to which class? Firstly, we’d have to create and define classes amongst the values. Our output, in that case, would be binary. Either Yes or No. Either 0 or 1. Instead of predicting a numerical dependent variable we’d implement prediction/classification of a categorical dependent variable.

We need an algorithm which using a threshold bifurcate the data into different classes. So how does Logistic Regression work? It’s a classification problem but formalized into a regression problem by trying to “regress” the conditional Probability of any of the classes Y=1. (modelling the conditional probability: P(Y=1|X=x)P(Y=1|X=x) knowing that,
P(Y=0|X=x)=1−P(Y=1|X=x)P(Y=0|X=x)=1−P(Y=1|X=x)) .

In the above figure, let’s consider threshold as 0.5. Hence —

Now, let’s build a classifier to detect if a credit card transaction is fraudulent or not.

Importing the data

The description of data is available on Kaggle. Let’s move ahead —

Visualisation

The Class feature is a binary feature, 0 for non-fraudulent and 1 representing fraudulent. As it can be seen, the data is heavily skewed as the majority of the cases are non-fraudulent. The Class feature is the Y (dependent variable) in this scenario. This will lead inaccurate results from the classifier as the training data isn’t normally distributed. Let’s level the playing field.

A machine learning model enters the bar:
Bartender: “What would you like to drink Sir?”
ML model: “Whatever the majority is having.”
-someone on tech twitter space.

Sub sampling and observations

To level the playing field, we create a sub-sample with equal number of fraudulent and non-fraudulent cases. In cases like fraudulency it’s a good habit to check if they’re any correlations. Let’s see what does Amount has to show us.

It’s different than the usual scatter plots that we see. But it does tell us something! Most fraudulent transactions are between the amount 0–2000.

Data splitting and Modelling

When implementing classifiers, we should also check for false positives. To understand that, we have to refer to confusion matrix.

Well it looks like our model classified 8 transactions as non-fraudulent even though they were fraudulent. And our accuracy score is 93.4%.

TL;DR

Logistic regression is emphatically not a classification algorithm on its own. It is only a classification algorithm in combination with a decision rule that makes dichotomous the predicted probabilities of the outcome. Logistic regression is a regression model because it estimates the probability of class membership as a multilinear function of the features.

The jupyter notebook of this project can be found here.

References: