Endogeneity is a key statistical concept. This is what will ensure whether the results calculated by a statistician are unbiased or not. Link between low voter turnout and score of the National Front, hierarchy of the drivers of clients’ satisfaction, impact of the number of screens on the first week revenue of a new movie: none of these issues can be properly analysed without taking into account a possible endogeneity in the data.

**A simple conditional model**

Suppose we want to predict how the revenue generated by a new movie for its opening week. We have data on a sample of movies, both on revenue and number of screens for that first week. A simple model would state that revenue is linearly linked to the number of screens :

revenue = a + b* (number of screens)

We want to estimate a and b. Once these parameters have been estimated, we will be able to forecast the revenue of any new movie for which we know the number of screens allocated by distributors. Parameters a and b can be estimated with a linear regression :

revenue = a + b* (number of screens) + u

u stands for everything that was not put into the model. In the language of modellers, this is a disturbance, or a residual. a and b are estimated by minimising the sum of squared u(s). Of course, this model is a simplified one: – Probably the relationship between the two variables is not linear. We could write the model with the logarithm of the two variables, or any other transformation. – Other characteristics most probably impact revenue: country of origin, type of movie, advertising expenditure, reviews, word of mouth,… At this stage, the added complexity is not needed to understand the crux of the matter. There is one crucial feature of this model: it focuses on the relationship between revenue and number of screens (and maybe other variables). We don’t care about the process that lead to the screen allocation: this process is not in our model and is actually irrelevant. The number of screens is a given, and we estimate revenue conditionally to the number of screens. Our model is thus a “conditional” model.

**Things get complicated**

Let us now step up to a grander modelling design. We are now interested in the process that produced both revenue **and** number of screens. We observe simultaneously two variables (revenue, number of screens) and we model together these two variables as a function of other characteristics of the movie. These could be the ones mentioned above, that we will collectively name Z. We are interested in the probability of observing a value (revenue, number of screens) as a function of Z :

P_{Z}(revenue, number of screens)

The Bayes decomposition formula tells us that:

P_{Z}(revenue, number of screens) = P_{Z}(revenue | number of screens) P_{Z}(number of screens)

This means that the probability of the bivariate variable (revenue, number of screens) is equal to the product of : – The conditional probability of the revenue, conditionally to the number of screens, – And the probability of the number of screens. The first part of this formula P_{Z}(revenue | number of screens) is precisely our initial model. We have modelled that conditional probability as a function of two parameters a and b. The previous equation can thus be rewritten:

P_{Z}(revenue, number of screens) = P(revenue | number of screens, a, b) P_{Z}(number of screens)

Let us focus now on the model for the number of screens. That number is going to be decided by the movie distributor, based on various criteria: country of origin, type of movie, advertising expenditure, reviews, word of mouth….In short, our Z variables. But not only. The distributor will probably factor in the movie’s anticipated revenue. Our model (revenu, number of screens) becomes :

revenue = a + b* (number of screens) + c * Z + u

number of screens = c + d* (revenue) + e * Z + v

By plugging the revenue in the second equation with its value in the first one, and assuming d*b is different from 1, we get :

number of screens = (c + d*a)/(1 – d*b) + (d*c + e)/(1 – d*b) * Z + w

And thus our Bayes formula of the bivariate variable (revenue, number of screens) becomes :

P_{Z}(revenue, number of screens | Z) = P(revenue | number of screens, Z, a, b, c) P(number of screens | Z, a, b, c, d, e)

This is the core of the issue. We are mainly interested in parameter b, which gives the elasticity of revenue to number of screens. The larger this coefficient, the more profitable the movie: it is thus crucial to estimate it properly. We thought we could quietly estimate it from the conditional model of revenue, knowing the number of screens. The previous equation shows that, if we do that, we set apart some of the information that the available data give us on parameter b : we don’t take into account the information on b what is brought by observing the relationship between the number of screens and the variables Z. In statistical jargon, the number of screens is not exogenous for parameter b. Put more simply, the variable number of screens is endogenous in the model :

revenue= a + b* (number of screens) + c * Z + u

**Consequences of endogeneity**

** ** When a variable is endogenous in a conditional model, the uncareful modeller is not going to use all the information at hand to estimate the parameters of the model. This has two possible consequences:

– The estimated parameters are less precise than what would be obtained if the full information has been used. The associated standard errors are larger

– Much worse, the estimated parameters might be biased.

It is quite frequent that an endogeneity issue will generate a bias in the estimation of a model. This bias can be pretty large and thus lead to wrong decisions in terms of economic, marketing or industrial policy, as we will see in the next section.

Before that, would the reader allow us to indulge in a few personal thoughts? Endogeneity is a crucial issue as soon as data modelling is involved. One of the founding papers on the topic, Exogeneity, was published in Econometrica in 1983. One of the authors, Robert Engel, got the Nobel prize for economics in 2003: a large part of his work has been devoted to endogeneity and its consequences. It is quite striking that this central question has been largely ignored in the marketing research practice, when minor issues like multicolinearity have received so much attention. Nobody ever got the Nobel prize for studying multicolinearity… A last point: endogeneity is a complex concept, which non statisticians will not grasp easily. It is however crucial to deal with it properly, if we want to take the right data driven decisions. One more example of the raison d’être of SLPV analytics: “Nothing is more practical than a good theory “ (Vapnik, preface to The Nature of Statistical Learning Theory).

**Three examples**

**Drivers of movies’ revenue**

Our simplified model above is was inspired from the article by Anita Elberse et Jehoshua Eliashberg, pubished en 2003 in Marketing Science « Demand and Supply Dynamics for Sequentially Released Products in International Markets: The Case of Motion Pictures ». The authors model simultaneously the revenue generated by movies in their opening week and the number of screens allocated by distributors. The below tables is an extract from their results:

Model for the logarithm of the revenue

Without taking into account endogeneity | Taking into account endogeneity | |

Logarithm number of screens |
0,74 (0,03) |
0,81 (0,04) |

Logarithm advertising expenses |
0,58 (0,07) |
0,20 (0,07) |

Logarithm average review ratinf |
0,55 (0,01) |
0,75 (0,03) |

As the logarithm of the variables enter into the model, the estimated coefficients can be directly read as elasticities : an increase of 1% in the number of screens would lead to an increase of 0,74% or 0,81% of revenue. As has been shown before, the number of screens is an endogenous variable in this model. Estimating the model without taking into account this endogeneity, leads to a bad diagnostic on two key variables: the impact of advertising expenses is over estimated (the elasticity is 0,20 and not 0,58), and the impact of review is under estimated (the elasticity if 0,75 and not 0,55).

**Impact of the 35 hours week on productivity**

Our second example come from an ENSAE econometric lecture, which is freely available on the Web. Its author is Bruno Crépon. One of his applied examples focuses on the impact of the working week reduction on production, when production factors (number of people and assets of the company) are unchanged. Here are his results:

Without taking into account endogeneity | Taking into account endogeneity | |

Impact of the working week reduction |
-0,036 (0,003) |
-0,161 (0,039) |

Estimating the model without taking into account endogeneity would tell us that the production only decreased by 3,6% after shortening the working week to 35 hours, which would imply a significant increase of productivity (as working hours decreased by 10,3%). When removing the bias due to the endogeneity of the working week shortening, the model tells us that production decreased by 16,% with production factors unchanged, which means a decrease in productivity.

**Link between voter turnout and score of the National Front**

Our last example is extracted from our article on the link between voter turnout and the share of votes of the National Front, at the French elections to Parliament in 2012. The article explains in detail how this was done. Let us focus ont the impact of two variables on the score of the National Front:

Without taking into account endogeneity | Taking into account endogeneity | |

Voter turnout – small towns |
0,078 (0,005) |
0,450 (0,022) |

Voter turnout – other towns |
0,192 (0,009) |
0,517 (0,021) |

Median income by household member – small towns |
0,171 (0,065) |
-0,653 (0,086) |

Median income by household member – large towns |
-1,211 (0,184) |
-1,806 (0,228) |

When endogeneity is not taken into account, the model finds a small positive association between the Front National score and voter turnout : 1% more turnout means 0,07% (in small towns) or 0,2% more for the FN. In reality, these percentages are 0,4 et 0,6%, thus a much more sizeable advantage for that party when turnout is low. The bias is even more spectacular when looking at the impact of voters’ revenue on the Front National score. The biased parameters, which do not take endogeneity into account, would indicate a positive effect in small towns (when median revenue is increasing, the vote share of the National Front would increase), when the effect actually is significantly negative.

The reader can see, from these 3 examples, the major diagnosis mistakes that would result from an ill chosen estimation strategy and from neglecting the endogeneity issue. ** **

**Impact of taking into account endogeneity on the precision of estimates**

** ** As usual with statistics, you cannot have your cake and eat it too: the bias/precision trade off is always there. Eliminating the bias will translate in less precise estimates. This can be seen particularly in examples 2 and 3 above. In the case of the working week shortening modelling, the standard error of the coefficient is multiplied by 10 (from 0,003 to 0,039). This is equivalent to dividing the sample size by 100!

In the FN modelling, the standard errors of the coefficients for turnout are multiplied by 3 or 4. ** **

**Endogeneity in the linear model**

The three examples above are in the frame work of linear regression. In that specific case, there is a simple necessary and sufficient condition for an explanatory variable to be exogenous. A linear model can be written as:

y = a + b*x + u

u is the residual : this is everything we did not put in the model. The variable x is exogenous if it is not correlated with the residual. Statisticians will more scholarly say that the conditional expectation of u knowing x is equal to 0 : E(u|x)=0.

This is interesting for two reasons:

– It points to a simple way of testing whether a variable is exogenous or not,

– It also hints at how to estimate a model with endogenous explanatory variables. The three models discussed above were estimated with the instrumental variables method: the model is estimated with the help of new variables correlated with x, but not correlated with u. This method is also known as two stage least squares.

**Your turn**

** ** Among the 4 models below, only one is free from endogeneity issues. Which one ?

**Related items**

Linear regression ** **

**References**

** ** All econometrics textbooks deal with endogeneity. The details of the Econometrica paper are :

R. Engle, D. Hendry, J. Richard (1983) : Exogeneity – Econometrica, 51, 277–304.

Another seminal paper for understanding endogeneity statistical testing is:

J.A. Hausman (1978): Specification Tests in Econometrics – Econometrica, Vol. 46, No. 6 (Nov., 1978), pp. 1251-1271

And here are the references for the examples we used in this article :

B. Crépon (2005) : Econométrie linéaire – polycopié du CREST

A. Elberse, J. Eliashberg (2003): Demand and Supply Dynamics for Sequentially Released Products in International Markets: The Case of Motion Pictures – Marketing Science,Vol. 22, No. 3, 329–354