First Steps to Understand and Improve Your OLS Regression — Part 1
They say linear regression models are the simplest approach towards supervised learning. But when you are new to statistics and data science, even the most straightforward model can be overwhelming and challenging to interpret. I started learning about linear regressions a few weeks ago as part of the data science boot camp I’m taking. The concepts seemed easy to follow until our instructor tasked us with building a baseline model and improving it through iterations. You plug your variables in the code to fit your regression and then face a summary table packed with information that should guide your next steps. If you’ve ever felt stuck at that stage, unsure what to make out of all these measures, I hope this blog post helps you.
I’ve taken a dataset on house sales in King County, WA (available on Kaggle here) to create a quick OLS model in Python (using the statsmodels module). I’ve performed some data cleaning and reduced the number of columns to simplify our example. Here are the top ten rows from our data frame:
Our model will use three independent variables (features) — bedrooms, sqft, and zip code- to explain our dependent (target) variable, “price.”
Before you fit your model, I advise you to check the linearity and multicollinearity assumptions to make sure that a regression model is appropriate for the data you’re working with. You want to see if your independent variables have a linear relationship with your dependent variable, and also check how the variables are correlated with each other.
To recap: A linear regression model with three predictor variables can be expressed with the equation above, where: β0 (beta-naught) is the intercept and the β1–3 as the coefficients for each independent variable. We also have e, the residual error, which is an unmeasured variable.
Depending on what statistical software or what statistical library you are working with, your regression function’s output might look slightly different. Rest assured, all the elements I’m discussing in this post will be available to you regardless. For convenience, I chose statsmodels because it provides this comprehensive summary table full of useful (and perhaps overwhelming to beginners) information.
Let’s run fit and summarize the OLS model using statsmodels in Python.
Now what? How do we interpret all of this? Do we need to pay attention to every single piece of information we see in this table?
I start by looking at the upper right section of the summary table (which is available to you by calling the .summary() method on your OLS object). Specifically, I look at the first four measures:
- R-squared. This is probably the most important measure you need to pay attention to as it captures the predictive power of your model. In our case, an R-squared of 0.244 tells us that the independent variables explain 24.4% of the variance in the dependent variable.
- Adjusted R-squared. R-squared by itself isn’t very helpful with multiple regression models as it tends to go up the more features you add to your model. In this current example, try to dummy the zipcode variable and turn each zip code into a separate independent variable. You’ll add about 70 features to your model, improving your R-squared, but not the model itself. Adjusted R-squared is the solution to this issue. It takes into account not just how many variables you add, but are these variables useful.
- The F-Statistic. This statistic gives us information about the coefficients of our regression on a macro level. The coefficients of all your features shouldn’t jointly be equal to zero; otherwise, the model is useless. So, the F-statistic allows us to reject the Null Hypothesis that jointly all coefficients for our independent variables equal 0. In our model, the F-statistic is 1829, which is relatively high and allows us to reject the Null Hypothesis at the 5% level of significance.
- The Probability of the F-Statistic. This is the p-value associated with that F test. If the p-value is less than your level of significance, you can reject the null hypothesis. That is the case with our model.
Next, we are going to get a bit more granular and explore the coefficients for each variable. All necessary information for that is located in the middle section of the table:
At the very top of the variables section in our summary, we see the constant term (the Y-intercept). All that constant does is estimating the price, assuming that all other features in the model are equal to zero.
Below, we see each of our independent variables. The coefficient represents the predictor variable’s actual effect on the target variable; it doesn’t give us the real slope! Again, the coefficient needs to be not-zero; otherwise, there would be no relationship between the two variables. It’s okay if the coefficient is a negative number; that just means that with each increase in unit for the feature, the values of the target decreases.
Let’s take a closer look at the coefficients starting with sqft. It’s pretty common sense that the price of a house will increase with its size. Price per sq ft is about as common as anything in real estate. As expected, we see a positive coefficient. It’s essential to keep in mind the scale of our independent variable to interpret the coefficient correctly: with an increase of 1 sqft, the house price increases by $119 on average.
Does the same logic apply to our third feature, zip code? Again, remember what the context of your interpretation is. Zipcodes are not continuous; they are a categorical variable, as I hinted above. If you switch from one zip code to another, you will see an increase in the price of about $327 on average. This doesn’t tell you much, does it? It’s best to reconsider the way you will include zip code in your model. You can use dummies, which is a typical way of handling categorical variables. Another alternative is to express the value of zip codes through a newly-engineered feature: for example, population density for each zip code or median household income for each zip code.
Keep in mind, each coefficient represents the additional effect of adding that variable to the model, if the effects of all other variables in the model are already accounted for.
Lastly, the middle section’s right side generally tells you the same thing, but in different ways: how significant is the feature. The higher the t-statistic (coefficient/std), the more significant the variable is. I first look at the p-values for each. The p-value gives you the probability of this coefficient occurring due to random chance. In other words, how likely is it that the population coefficient is zero, but due to bad luck with our sampling, we get this non-zero coefficient? The rule here is, if the p-value for your feature is above 0.05, you want to remove that feature from your model.
In the next part of this series, I’ll talk about the steps you can take to improve your model with your next iteration. We’ll check for multicollinearity, heteroscedasticity, and the normal distribution of our residuals. Based on our findings, we’ll discuss appropriate steps to resolve any potential issues in the next iteration.