Logistic Regression

6 min read

Can you predict success in exams?

Have you ever helped your younger sibling prepare for a big exam, and thought to yourself why do some students seem to breeze through exams effortlessly, while others struggle despite putting in long hours? And that can there be a way to predict whether a student will pass or fail based on their study hours?

Let’s delve into this problem. Understanding how different factors influence exam outcomes leads us to introduce a vital statistical tool - Logistic Regression!

Understanding Logistic Regression

Logistic regression is a method effective for predicting categorical outcomes. Yes, categorical outcomes like categorizing if a student passes or fails an exam based on their study hours.

But why not use Linear Regression in this case?

Because linear regression outputs values across the entire range of [-∞, ∞], but in our case, we're dealing with 'pass' or 'fail,' like flipping a light switch on or off. Hence, we require a function yielding - 'pass' or 'fail', 'yes' or 'no', 0 or 1 - for any given input.

How does logistic regression work?

Let’s start by understanding the sigmoid function.

The sigmoid function, also known as the logistic function, is an S-shaped curve smoothly transitioning between 0 and 1 as the input changes. In the pass-fail example, the sigmoid curve takes the study hours as input and maps them to the probability of passing the exam. The curve starts low, then gradually rises, and eventually levels off at nearly 1.

Imagine a black box that takes different study hours as inputs, and outputs probabilities of passing or failing the exam that is then mapped to either a 0 (fail) or a 1 (pass)

With the continuous probability between 0 and 1, how is it determined whether the student passes or fails?

Choosing the right threshold

The point on the sigmoid curve represents the probability of passing or failing the exam. Now imagine the point on the sigmoid curve where it transitions from predicting 'fail' to predicting 'pass.' This is the threshold value. In the context of our pass-fail example, this value represents the study hours at which we would say, 'Okay, if the probability of passing is above this point, we predict 'pass,' and if it's below, we predict 'fail.''

In the graph below, we have assumed the threshold to be 0.5. If the student studies for 6 hours, the probability of passing the exam becomes >0.5, and hence the student would be predicted as 'Pass'

Choosing the threshold involves finding a balance. Opting for a higher threshold means we predict fewer 'pass' cases but with higher accuracy. Conversely, opting for a lower threshold includes more 'pass' predictions, yet it may lead to false positives – instances where we predict a 'pass' but the student actually didn't.

Do you now wonder, just like we do, how are the study hours converted to probability?

Converting study hours to probability

To understand this let's bring everything we learned so far together and expand upon that.

Imagine we have a dataset filled with information about how many hours students studied and whether they passed or failed. This is known as our training data. The objective of logistic regression is to build a formula that, when given the number of hours a student studied, can tell whether they passed or failed.

When we input the hours studied into this formula, it gives us a probability score, which when close to 0, indicates that we are confident that the student has failed, and when close to 1 indicates that the student has passed. But how do we know if our formula is good?

That's where the concept of error comes in.

Error is the difference between how sure our formula is and what actually happened. If our formula says a student passed with a high probability, but they actually failed, the error is high. On the other hand, if our formula hesitated and said they might fail, and they did fail, the error would be lower.

When we add all these errors for all the students in our training data, we get the total error. If you are now thinking if we need to minimize the errors, then you are on the right path.

How do we minimize the errors though?

We use the same old optimization techniques like gradient descent, a method that helps us iteratively test different parameter values until we find the ones that yield the lowest overall error.

The equation for the line remains unchanged from linear regression, but a Logistic function is introduced at the beginning. This Logistic function serves to compress the values, ensuring they are within the range of 0 to 1.

What if we have more than 2 categories?

Let's say we want to predict whether a student will score below 50, between 50 to 80, or above 80 marks. You might ask if logistic regression can handle such cases.

It can.

While our detailed discussion focused on binary classification with just two classes, we can expand it to multiple classes, which is known as multiclass classification. So, whether you're classifying different flower species, or diagnosing illnesses, logistic regression can be a dependable tool for all such tasks.

Real-World Business Use-Case

Empowering Healthcare with Data-Driven Patient Insights

Imagine a medical research institute striving to improve patient outcomes by identifying the likelihood of a certain medical condition based on various patient characteristics. The institute has gathered vast datasets containing patient profiles, medical histories, genetic markers, and diagnostic test results.

In this context, logistic regression serves as a powerful classification tool. By examining the relationships between patient attributes such as age, family history, genetic markers, and test results, logistic regression can predict the probability of a patient having a specific medical condition.

For instance, it might unveil that patients with a certain genetic marker and a family history of the condition are more likely to develop the medical condition. Conversely, patients with certain test results falling within a certain range might exhibit a lower probability of having the condition.

Armed with these insights, the medical research institute can enhance its diagnostic processes, develop personalized treatment plans, and allocate resources more efficiently. In a field where early detection and targeted interventions are crucial, leveraging logistic regression's predictive capabilities can lead to improved patient care, and advancements in medical research.

Have any burning questions or specific topics you'd like us to explore? Share your thoughts at hi@simplifyingstuff.com

← Back to Blog