Multicollinearity occurs when independent variables in a regression model are correlated. This correlation is a problem considering independent variables should exist
contained. If the caste of correlation betwixt variables is high enough, information technology can cause problems when yous fit the model and interpret the results.
In this blog mail, I’ll highlight the problems that multicollinearity can cause, show yous how to exam your model for it, and highlight some ways to resolve it. In some cases, multicollinearity isn’t necessarily a trouble, and I’ll show you how to make this determination. I’ll work through an example dataset which contains multicollinearity to bring information technology all to life!
Why is Multicollinearity a Potential Problem?
A cardinal goal of regression analysis is to isolate the relationship between each independent variable and the dependent variable. The estimation of a regression coefficient is that it represents the hateful modify in the dependent variable for each 1 unit alter in an independent variable when you lot
hold all of the other independent variables constant. That last portion is crucial for our word about multicollinearity.
The thought is that you can alter the value of one contained variable and not the others. However, when independent variables are correlated, it indicates that changes in one variable are associated with shifts in another variable. The stronger the correlation, the more difficult it is to change one variable without irresolute another. It becomes difficult for the model to estimate the human relationship betwixt each independent variable and the dependent variable
considering the independent variables tend to change in unison.
There are ii basic kinds of multicollinearity:
- Structural multicollinearity: This type occurs when we create a model term using other terms. In other words, it’s a byproduct of the model that we specify rather than being present in the data itself. For example, if you square term X to model curvature, clearly in that location is a correlation between 10 and X2.
- Data multicollinearity: This type of multicollinearity is nowadays in the data itself rather than beingness an artifact of our model. Observational experiments are more likely to exhibit this kind of multicollinearity.
Related mail: What are Independent and Dependent Variables?
What Issues Do Multicollinearity Cause?
Multicollinearity causes the following two basic types of issues:
- The coefficient estimates can swing wildly based on which other independent variables are in the model. The coefficients become very sensitive to small changes in the model.
- Multicollinearity reduces the precision of the estimated coefficients, which weakens the statistical power of your regression model. You might non be able to trust the p-values to identify independent variables that are statistically significant.
Imagine you lot fit a regression model and the coefficient values, and even the signs, change dramatically depending on the specific variables that you lot include in the model. It’south a disconcerting feeling when slightly unlike models lead to very dissimilar conclusions. You don’t experience like you know the bodily effect of each variable!
Now, throw in the fact that you can’t necessarily trust the p-values to select the independent variables to include in the model. This problem makes it difficult both to specify the correct model and to justify the model if many of your p-values are non statistically significant.
As the severity of the multicollinearity increases then practise these problematic effects. However, these issues touch merely those independent variables that are correlated. You can accept a model with severe multicollinearity and notwithstanding some variables in the model can be completely unaffected.
The regression example with multicollinearity that I work through later on illustrates these problems in action.
Do I Accept to Fix Multicollinearity?
Multicollinearity makes it difficult to interpret your coefficients, and information technology reduces the power of your model to identify independent variables that are statistically meaning. These are definitely serious problems. Yet, the good news is that y’all don’t always have to notice a way to fix multicollinearity.
The need to reduce multicollinearity depends on its severity and your master goal for your regression model. Keep the following three points in mind:
- The severity of the problems increases with the degree of the multicollinearity. Therefore, if you lot have only moderate multicollinearity, you may non demand to resolve information technology.
- Multicollinearity affects only the specific independent variables that are correlated. Therefore, if multicollinearity is not nowadays for the independent variables that you lot are particularly interested in, you may not need to resolve it. Suppose your model contains the experimental variables of involvement and some control variables. If high multicollinearity exists for the control variables but non the experimental variables, then yous can translate the experimental variables without problems.
- Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions, and you don’t demand to understand the function of each independent variable, you don’t demand to reduce severe multicollinearity.
Over the years, I’ve found that many people are incredulous over the third signal, so here’south a reference!
The fact that some or all predictor variables are correlated among themselves does non, in general, inhibit our power to obtain a good fit nor does it tend to touch on inferences nigh hateful responses or predictions of new observations. —Applied Linear Statistical Models, p289, 4thursday
If y’all’re performing a designed experiment, it is probable orthogonal, meaning information technology has nil multicollinearity. Acquire more about orthogonality.
Testing for Multicollinearity with Variance Aggrandizement Factors (VIF)
If you lot tin can identify which variables are afflicted by multicollinearity and the strength of the correlation, y’all’re well on your way to determining whether you need to fix information technology. Fortunately, in that location is a very simple test to appraise multicollinearity in your regression model. The variance inflation factor (VIF) identifies correlation betwixt independent variables and the strength of that correlation.
Statistical software calculates a VIF for each independent variable. VIFs start at 1 and have no upper limit. A value of ane indicates that in that location is no correlation betwixt this contained variable and any others. VIFs between ane and 5 suggest that there is a moderate correlation, but it is not severe enough to warrant corrective measures. VIFs greater than 5 represent critical levels of multicollinearity where the coefficients are poorly estimated, and the p-values are questionable.
Utilize VIFs to identify correlations between variables and determine the strength of the relationships. Most statistical software tin display VIFs for you. Assessing VIFs is particularly important for observational studies because these studies are more decumbent to having multicollinearity.
Multicollinearity Example: Predicting Bone Density in the Femur
This regression example uses a subset of variables that I collected for an experiment. In this instance, I’ll show yous how to discover multicollinearity besides as illustrate its effects. I’ll also bear witness you how to remove structural multicollinearity. Yous can download the CSV data file: MulticollinearityExample.
I’ll use regression assay to model the relationship between the contained variables (concrete activeness, body fat per centum, weight, and the interaction betwixt weight and body fat) and the dependent variable (bone mineral density of the femoral neck).
Hither are the regression results:
These results show that Weight, Activity, and the interaction between them are statistically meaning. The percent body fat is non statistically significant. However, the VIFs indicate that our model has severe multicollinearity for some of the independent variables.
Notice that Activity has a VIF near one, which shows that multicollinearity does not affect it and we tin can trust this coefficient and p-value with no further action. However, the coefficients and p-values for the other terms are suspect!
Additionally, at least some of the multicollinearity in our model is the structural type. Nosotros’ve included the interaction term of trunk fat * weight. Clearly, there is a correlation between the interaction term and both of the chief event terms. The VIFs reverberate these relationships.
I have a neat fob to testify yous. There’s a method to remove this blazon of structural multicollinearity quickly and easily!
Middle the Independent Variables to Reduce Structural Multicollinearity
In our model, the interaction term is at least partially responsible for the loftier VIFs. Both higher-society terms and interaction terms produce multicollinearity because these terms include the main effects. Centering the variables is a simple way to reduce structural multicollinearity.
Centering the variables is likewise known as standardizing the variables past subtracting the hateful. This process involves calculating the mean for each continuous contained variable and then subtracting the mean from all observed values of that variable. Then, use these centered variables in your model. Nigh statistical software provides the characteristic of plumbing fixtures your model using standardized variables.
In that location are other standardization methods, but the reward of merely subtracting the mean is that the estimation of the coefficients remains the aforementioned. The coefficients proceed to represent the mean modify in the dependent variable given a 1 unit of measurement change in the independent variable.
In the worksheet, I’ve included the centered independent variables in the columns with an S added to the variable names.
For more than virtually this, read my post about standardizing your continuous independent variables.
Regression with Centered Variables
Permit’s fit the same model but using the centered independent variables.
The almost apparent divergence is that the VIFs are all down to satisfactory values; they’re all less than v. By removing the structural multicollinearity, nosotros can see that there is some multicollinearity in our information, but it is not severe enough to warrant further corrective measures.
Removing the structural multicollinearity produced other notable differences in the output that we’ll investigate.
Comparing Regression Models to Reveal Multicollinearity Effects
We can compare 2 versions of the aforementioned model, i with high multicollinearity and ane without it. This comparison highlights its effects.
The starting time contained variable we’ll look at is Activity. This variable was the just i to have most no multicollinearity in the showtime model. Compare the Activeness coefficients and p-values between the two models and you’ll see that they are the aforementioned (coefficient = 0.000022, p-value = 0.003). This illustrates how only the variables that are highly correlated are affected past its problems.
Permit’southward look at the variables that had high VIFs in the outset model. The standard error of the coefficient measures the precision of the estimates. Lower values indicate more than precise estimates. The standard errors in the 2d model are lower for both %Fat and Weight. Additionally, %Fat is significant in the 2d model even though it wasn’t in the first model. Not only that, but the coefficient sign for %Fatty has inverse from positive to negative!
The lower precision, switched signs, and a lack of statistical significance are typical bug associated with multicollinearity.
Now, take a expect at the Summary of Model tables for both models. You’ll notice that the standard error of the regression (S), R-squared, adjusted R-squared, and predicted R-squared are all identical. As I mentioned earlier, multicollinearity doesn’t affect the predictions or goodness-of-fit. If you merely want to make predictions, the model with severe multicollinearity is simply equally good!
How to Bargain with Multicollinearity
I showed how there are a variety of situations where you don’t need to bargain with it. The multicollinearity might not be severe, it might not affect the variables you’re most interested in, or maybe you just need to make predictions. Or, possibly information technology’due south just structural multicollinearity that you tin can go rid of by centering the variables.
But, what if yous accept severe multicollinearity in your data and yous find that you lot must deal with it? What practice you do then? Unfortunately, this state of affairs can be hard to resolve. In that location are a diverseness of methods that you can try, just each one has some drawbacks. You’ll need to employ your subject-area cognition and factor in the goals of your report to pick the solution that provides the all-time mix of advantages and disadvantages.
The potential solutions include the post-obit:
- Remove some of the highly correlated independent variables.
- Linearly combine the independent variables, such as adding them together.
- Perform an analysis designed for highly correlated variables, such equally principal components analysis or partial least squares regression.
- LASSO and Ridge regression are advanced forms of regression analysis that tin can handle multicollinearity. If you know how to perform linear least squares regression, you’ll be able to handle these analyses with just a petty additional study.
As you consider a solution, remember that all of these accept downsides. If you lot can accept less precise coefficients, or a regression model with a high R-squared but hardly whatsoever statistically significant variables, then not doing anything nigh the multicollinearity might be the best solution.
In this post, I employ VIFs to check for multicollinearity. For a more in-depth expect at this mensurate, read my mail about Computing and Assessing Variance Aggrandizement Factors (VIFs).
If you’re learning regression and similar the approach I use in my blog, check out my Intuitive Guide to Regression Analysis book! You can observe it on Amazon and other retailers.