Logistics Regression, an optimal balance between accuracy and explanatory power

Jai Kushwaha
4 min readSep 26, 2020

Logistic regression is generalized form of linear regression. It helps in predicting and classifying the data as win or loss, true or False, Hit or flop or binary form as 0 or 1. It’s a Machine learning technique for binary classification.

The Logit function, also called Sigmoid function it’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.

1 / (1 + e^-z)

Where z= b0 + b1*x1 + b2*x2….. (linear equation where b1 , b2, are coefficients)

Sigmoid Function
Variant of Sigmoid Function

Points to ponder regarding Logistic regression:

  • Accuracy increases as output is binary, performance improved by AUC ROC confusion matrix, over fitting not there.
  • Easy to apply as homogeneity not required, error not normally distributed, large datasets can be used, testing and training of data makes output more efficient.
  • Explanatory power

# measure of how relevant a predictor i.e. Coefficient size.

# Direction of association i.e. positive or negative.

The decision for the value of the threshold value is majorly affected by the values of precision and recall. Ideally, we want both precision and recall to be 1, but this seldom is the case. In case of a Precision-Recall trade off we use the following arguments to decide upon the threshold:-

  1. Low Precision/High Recall: In applications where we want to reduce the number of false negatives without necessarily reducing the number false positives, we choose a decision value which has a low value of Precision or high value of Recall. For example, in a cancer diagnosis application, we do not want any affected patient to be classified as not affected without giving much heed to if the patient is being wrongfully diagnosed with cancer. This is because, the absence of cancer can be detected by further medical diseases but the presence of the disease cannot be detected in an already rejected candidate.
  2. High Precision/Low Recall: In applications where we want to reduce the number of false positives without necessarily reducing the number false negatives, we choose a decision value which has a high value of Precision or low value of Recall. For example, if we are classifying customers whether they will react positively or negatively to a personalized advertisement, we want to be absolutely sure that the customer will react positively to the advertisement because otherwise, a negative reaction can cause a loss potential sales from the customer.
  3. We can evaluate Logistic Regression Model fit and accuracy in many ways — few listed here
  • AIC — Akaike Information criteria
  • Important indicator of model fit. It follows the rule: Smaller the better.
  • Confusion Matrix
  • Accuracy, True Positive Rate (TPR), False Positive Rate (FPR), True Negative Rate (TNR), False Negative Rate (FNR), Precision
  • Receiver Operator Characteristics (ROC)
  • Determines the accuracy of a Model at a user defined threshold value and by using Area Under the curve (AUC)
  • Null Deviance and Residual Deviance
  • Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model. Residual deviance indicates the response predicted by a model on adding independent variables. Lower the value, better the model

Comparing Logistic regression with Linear regression and Decision Tree

Accuracy and selection of variables

Accuracy of the model is based on the selection of independent variable which has to balance with the selection of relevant independent variables.

  • We can successively build a more robust model with the help of key independent variables only that can better explain any variation in dependent variable.
  • The extent of variation explained by each independent variable is assessed with respect to Log Odds ratio or probability estimates.
  • While we select significant variables to build a robust model, we assess the accuracy of each model.
  • Selection of the final model is carefully done and it aims at striking a very good balance between accuracy and significant variables that better explain any variation in the dependent variable.

From the accuracy prospect machine learning techniques used for classification like CART, Random Forest, ANN will provide better results however fall short in exploratory power as these are black box techniques. Hence there are limitations to use machine learning techniques where explanation plays vital role.

Some Examples

Logit model can be explained due to the simple concept of outcome prediction using Logarithmic mathematics and the sigmoid function and present solutions to business problems like

  • predicting Loan Defaulters
  • predicting Win Loss scenarios
  • segregating spam mail from relevant mail

Conclusion

There are other models like neural networks and random forest that are relatively more efficient in terms of prediction. However, these are black-box techniques and lack explain-ability. For instance — If we are tracking five variables to predict brand loyalty, we can never know what is the relative importance and weight-age of each variable, through black-box techniques. There are scenarios in which the end result is not very important but explain-ability of the model is important. In all such cases, logistic regression is a far better technique. When we consider both the factors (explaining power and accuracy), logistic regression is a better model.

--

--

Jai Kushwaha

I am a 11yrs+ experienced Senior Consultant in Analytics and Model development with domain expertise in BFSI.