Multicollinearity is an intrinsic and probably desirable characteristic of perceptual data. But multicollinearity, if not handled correctly, can lead to large confidence intervals around estimated parameters in a model, and thus to difficult to interpret or truly non actionable results.

Suppose you want to assess the influence of distance from home on the decision to shop in a given supermarket. A possibility would be to model being a client (a yes/no variable) as a function of the customer’s characteristics and that distance, in a logistic regression.

What would happen if we were to include in the model the distance measured both in meters and in miles?

Here are the explanatory variables of the model for the first 4 respondent:

HH size |
Children in HH |
HH income |
Market size |
Distance in meters |
Distance in miles |
…… |

2 | No | 2 | 5 | 150 | 0,093 | |

4 | yes | 2 | 2 | 800 | 0,497 | |

1 | No | 1 | 3 | 50 | 0,031 | |

3 | yes | 4 | 5 | 450 | 0,280 |

The two columns related to the distance are exactly proportional. This conflicts with the basic principle of regression, which is to give a weight to each variable n the decision for being a client. No way to give separate weights to the distance in meters and to the distance in miles: it is actually the same variable.

In a regression, this will translate into the fact the cross product matrix of the explanatory variables cannot be inverted. We would need to delete one of the two distance variables in order to estimate the model.

As soon as you work with customer’s perceptions, variables are often correlated.

Suppose you are interested in client’s satisfaction with their bank. Thee clients gave a satisfaction rating on a scale from 1 to 10. Here are the correlations observed across the various rated dimensions:

Overall Satisfaction | Satisfaction with the branch | Satisfaction with the account manager | Satisfaction with product offer | …… | |

Overall satisfaction | 1,00 | 0,72 | 0,79 | 0,47 | |

Satisfaction with the branch | 0,72 | 1,00 | 0,75 | 0,42 | |

Satisfaction with the account manager | 0,79 | 0,75 | 1,00 | 0,38 | |

Satisfaction with product offer | 0,47 | 0,42 | 0,38 | 1,00 | |

…… |

Overall satisfaction is highly correlated with satisfaction with the branch or satisfaction with the account manager: these two dimensions will be important drivers of overall satisfaction.

Satisfaction with the branch is also highly correlated with satisfaction with the account manager. And these two dimensions would be key explanatory variables in a driver model seeking to explain overall satisfaction.

For example, a simple model could be the following :

Overall satisfaction= a+ b*satisfaction woth the branch+ c*satisfaction with the account manager+ d* satisfaction with product offer+…

b et c are the weights of the branch and of the account manager in the construction of overall satisfaction. This type of modelling – in general a bit more sophisticated – allows us to understand drivers of satisfaction and thus to prioritise actions needed to make clients more satisfied. .

As satisfaction with the branch and satisfaction with the account manager are highly correlated, parameters b and c might be estimated ith little precision. This is what is called multicollinearity.

If these two dimensions were perfectly correlated, as in the previous section, we would only estimate one parameter. We would be in the framework of exact collinearity, with a non invertible cross product matrix of the explanatory variables.

When two variables are highly correlated, the cross product matrix of explanatory variables is almost non invertible. Thus its invert is very large, a little bit like the inverse of a number close to 0 is a large number. As confidence intervals around the estimated parameters in a regression are proportional to the elements of this invert matrix, they can be very large when explanatory variables are highly correlated.

This is a very practical issue: knowing whether the branch is a more important satisfaction driver than the account manager, or as important, or less important, will lead to different CRM investments. And confidence intervals are there to help answer that question.

It is thus really important to understand how to handle this type of data.

Multicollinearity is often seen as a shameful disease that we should get rid of. This is not how SLPV analytics see things. Why are perceptual data about a brand correlated? Precisely because they have a common dimension. And this dimension is quite often the brand dimension. Without brand effect, there is no – or much less – multicollinearity.

If the data that we collect would not be multicollinear, large parts of marketing science would disappear. Good thing indeed that our perceptual data are correlated. IT would be very strange if they happened not to be.

** **

But of course, we need to handle the issue of the possible small precision of estimated parameters in a model where data are highly correlated. It is then crucial to diagnose correctly to get the correct remedy.

What is the issue with multicollinearity? It is not a bias issue, but a precision issue (see proof here). The two notions are really different (we talk about that here) and a basic principle of statistics is that there is a trade-off between the two.

What does Wikipedia, the wisdom of crowd, have to say on the subject? 10 solutions are suggested:

– Five don’t solve anything: (1) Check that there is no exact collinearity : this would be immediately seen on the regression output (2) Check how coefficients would vary if the model were estimated on subsamples. Useless standard deviations of the coefficients would tell us (3) Leave things as they are. (6) and (7) Center and reduce explanatory variables. This does not change anything in the model.

– Three others propose to introduce a bias :(4) Delete variables from the model (8) Use Shapley Value or (9) use Ridge regression or make a PCA of explanatory variables.

– Two effectively propose to increase the precision of the model, without necessarily introducing a bias: (5) Obtain more data (10) In the specific case where the model is calculated on time series, with lagged values of the explanatory variables, impose a structure on the coefficients of the lagged variables.

This list of solution, and its ranking, is characteristic of the confusion among econometrics practitioners on this subject.

In our mind, introducing a bias in a model to solve a precision problem is like cutting the head of a migraine prone patient. When statistical tools are available to measure the precision of a model, by definition, there is none to measure bias. A large bias might lead to non actionable recommendations, even damaging ones. It should be noted however that this type of solutions has numerous followers, with Ridge regression and, even worse in our eyes, Shapley Value.

We suggest two other solutions :

– Obtain more data is the best answer to the issue. We show here that this is indeed solving the issue. This solution is more actionable that you first might think. Modelling on perceptual data often happens in a barometric context. Brand preference drivers, satisfaction drivers,.. don’t vary significantly across a few months, or even across years in some sectors. Cumulate waves of a barometer to get more data, and thus more precision in the modelling, is an excellent answer to the challenges raised by multicollinearity.

– Reduce the number of estimated parameters in the model, by testing coefficients equality. This is a similar solution as the one suggested by Wikipedia in the particular case of time series data. By imposing constraints on the coefficients, the dimensionality of the model is reduced end experience shows that confidence intervals around the coefficients shrink.

** **You would like to understand the preference drivers of internet access providers. In order do that, you have collected clients’ perceptions on around thirty dimensions, on product offer, client service, support, tariffs, brand perception.

Then, you estimate a model, linking preference for a provider to perceptions about that provider and its competitors. Perceptions across all dimensions are highly correlated and a first model linking preferences to all dimensions has led to non-conclusive results, with large confidence intervals around estimated parameters.

Which one of the following options do you prefer as a next stage?

** **

To the amazement of the author of this article, the subject of multicollinearity has fascinated numerous authors: theoretical, as well as pratical, papers on the subject are in large supply. Much more so than papers on more fundamental subjects like endogeneity. Any well behaved econometrics handbook will thus have a section on the subject. A classical reference is :

A.S. Goldberger (1964) : Econometric theory – John Wiley & Sons.

Or

A.S. Goldberger (1991) : A course in econometrics – Harvard University Press.

Except if I am mistaken, the first paper proposing to use Shapley Value for dealing with multicollinearity is:

Conklin M., Powaga K., Lipovetsky S., Customer Satisfaction Analysis (2004) : Identification of Key Drivers – European Journal of Operational Research, 154/3, 819-827.