Linear Regression and Multiple Linear Regression Modelling Problems and Diagnostics
Simplest of techniques that can be used for a supervised learning algorithm in regression problems is Linear regression. It gives the simplest of understanding of the output. But suffers from the problem of High Bias(Simple Model aka Low model complexity → High Bias) and low variance(due to under fitting of training data →Low variance).
# Dependent variables need to be continuous in Linear regression. But if Categorical then they are converted to dummy variables and then process them as per the image below :
If predictor as categorical variable is considered, the software will estimate a mean of Y for each category of the predictor (shown below in red). There will be a set of coefficients for that predictor. Each one measures the difference in the means of Y between one category of X and the reference category.
In what scenario linear regression can be used?
Answer is Supervised learning — Regression problem
Simple and Multiple linear regression
- Regression is a technique for finding the relationship between a response variable and one or more explanatory variables.
- • Simple Linear Regression: Predict Y using only one independent variable — Estimated y = b0 + b1 * x1
- • Multiple Linear Regression: Predict Y by considering more than one independent variable — Estimated y = b0 + b1 *x1 + b2 *x2
Now comes the part of how the best fit line is selected : answer is residual sum of squares.
There can be many lines but how to decide strength of best fit line:
- R-squared is a best metric to measure the strength of the best fit line
- • R-squared is a statistical measure of how close the data are to the fitted regression line
- • It always takes a value between 0 & 1, 1 indicates that the variance in dependent variables is completely explained by the independent variable • Mathematically, R squared = 1 — RSS/ TSS where, RSS = Residual sum of square TSS = Total sum of square
Coefficient of determination
- R squared = 1 — RSS / TSS where, RSS = Residual sum of square TSS = Total sum of square R-squared: Best fit line Example •Dependent variable: Insurance amount •Independent variable: Marketing Budget
- •RSS = 37.12 •TSS = 292.42 •R-squared = 0.87 •R-squared-0.87 indicates that 87% of the variation in ‘insurance amount’ is explained by the independent variable, ‘Marketing Budget
Interpreting SLR equation:
Equation is — • Insurance_amount = 0.239*Marketing_budget -1.379
• β0 = -1.379 • β1 = 0.239
- The simple linear regression equation tells us that the predicted insurance amount for the insurance team will increase by 0.239 of term insurance amount for every one percent increase in the marketing budget.
Now if more terms are added to the models then comes Multiple linear regression. So again find the best fit line using R square. But another term is added called adjusted R square.
Adjusted R-square
•The adjusted R2 statistic penalizes the analyst for adding terms to the model.
• It can help guard against over fitting (including regressors that are not really useful)
• Particularly for small N and where results are to be generalized, take more note of adjusted R2
• Adjusted R2 is used for estimating explained variance in a population. • Adjusted R-squared is a better metric than R-squared to assess how good the model fits the data.
MLR framework is introduced which shows the flow of modelling.
Variable Selection is a step wise process:
- Adding one variable by variable and checking p- value
- Checking p- value of variables
- Eliminating the variable of larger p-value than threshold
Then comes the problem of Multicollinearity.
First we should define what is multicollinearity
•It is a state of very high inter-correlations or inter-associations among the independent variables (i.e. X1,X2,X3 etc.).
• Let say, there are three variables in the dataset i.e. X1,X2 and X3. There is strong correlation between these variables?
- What will be effect of this? How to handle this situation?
- Why Multicollinearity is a problem?
- • Multi collinearity has no impact on predictive power of the model as a whole, but it affects the calculation for individual predictors
- • Parameter estimates may change erratically for small changes in the model or data. It will make the estimate highly imbalanced.
- • This instability will increase the variance of estimates. It means that if there is a small change in X (i.e. independent variable), produces large changes in estimate(i.e. expected output).
Using Variance Inflation Factor:
As discussed earlier of P-value significance
- Remove variable which has a higher p-value in order of their insignificance. • If the model have a larger of variables in the final model, then the criteria for selecting variables in the model can be made toughen i.e. p
Four Assumptions in Regression:
Assumption 1:
- The regression model is linear in parameter o For example, the below equation shows linear relationship in parameters. Top-1 : Assumptions Y = β0 + β1 * X1 2 + β2 *X2 • Linearity implies that the coefficient of variables are linear in nature not the variable in itself • Non-Linear Equation, it indicates that coefficients are not varying linearly
Assumption 2:
Normality of residuals i.e. residuals should follow a bell curve distribution with zero mean
- Detect using plot of residuals
- Possible cause: Missing X
Assumption 3:
Homoscedasticity of residuals or equal variance
- Opposite of Homoscedasticity is Heteroscedasticity — When the error variance changes in a systematic pattern with changes in the X value.
- Meaning that, error variance itself changes with value of X
- Appropriate transformation can eliminate the heteroscedasticity problem
Detection of Heteroscedasticity
o Best way to detect heteroscedasticity is to visualize the relationship between squared residual and the independent variable, try to identify whether the plot exhibits any pattern.
o For multiple linear regression, try to plot squared residuals against each of the independent variables.
o If the number of independent variables are large, then plot the squared residuals with the predicted output variable
Assumption 4:
- No Multicollinearity among independent variables
- Multi collinear variables should be removed from the final model
- VIF (variance inflation factor) is used to remove multi collinearity.
Conclusion:
So in all, from understanding the business problem to final modelling there are so many hurdles but these hurdles tell more about the data that no other model does. As we know about the black box models they work on different level but the interpretability of SLR and MLR are above all which helps the BA to summarize the output to business in an effective way. So some pains are good to have and move forward taking the best out of it.