A statistical result is always assessed across two dimensions: bias and precision. These two dimensions are independent. You may have unbiased and precise estimates, unbiased and not precise, biased and precise, unbiased and unprecise.

Biasi is defined with respect to a true value that you seek to estimate.

The number of votes collected by a politician, the number of clients who are going to buy the new iPhone, the number of people reading a newspaper, the number of clients who are going to rate with the top rating their satisfaction with a service: all these quantities could be measured exactly if we were to do a census, i.e. interview the whole population. We would then know their true value.

Surveys based on samples allow us to calculate an estimate of these quantities.

A survey, and the related estimate, is unbiased if that estimate gets ever closer to the true value when the sample size is growing.

Bias is defined as the difference between the true value and the asymptotic value of the estimate when the sample size is growing.

The bias of an estimate will depend on the way the data are collected: sampling, questionnaire, data collection mode,….

** **

With one survey, you can calculate one estimate of the true value.

If you were to do several surveys, in the same conditions, but with different samples, you would obtain several estimates. Of course, research agencies would never do that, because they would obtain different estimates for each different sample. Which is very difficult to communicate about. However, the situation were two different estimates for the same quantity are communicated to the general public is not that rare. Take the case of political polling done by two different research agencies at the same time,

The precision measures the dispersion of different estimates, calculated on different samples collected in the same way.

Variance is the opposite of precision. The smaller the variance, the larger the precision.

The precision of an estimate depends mainly on sample size.

In the below chart, the **|** represents the true value you seek to estimate. For example, the score of a presidential candidate. Each dot represents one estimate, calculated on a sample size of 100. There are 40 different estimates :

Here is the same chart, still with 40 estimates, but with respectively 500 and 1000 interviews :

All estimates are unbiased: the larger the sample size, the closer to the true value. And they are also more and more precise when the sample size is increasing: the 40 estimations are closer to each other.

A basic statistical theorem says that you cannot have everything at the same time, i.e. an estimate with minimal bias and maximum precision. A trade-off is to be made across the two dimensions. Classically, among the unbiased estimates, there is one with maximum precision. But you can have biased estimates that are more precise.

Here are two different types of estimates. The first group is unbiased, the second is based. But the second is more precise than the first one. Which one do you prefer?

** **

- Confidence intervals
- Muticolinearity
- Shapley value
- Linear regression
- Precision of political polling
- Variance

** **

- S. Kullback (1959): Information Theory and Statistics – Wiley
- T.S. Ferguson (1967) : Mathematical Statistics – Academic Press
- J.P. Lecoutre (2012) : Statistique et probabilités – Dunod
- A. Monfort (1982) : Cours de statistique mathématique –- Economica
- S.D. Silvey (1975) : Statistical inference – Chapman and Hall