Statistical tests upgrade raw statistical material into actionable learnings. They are at the origin of any data driven decision making.

Statistical tests and confidence intervals are two side of the same coin: in order to do a test, you need to calculate a confidence interval. IT is always useful to bring back a test problem to a confidence interval problem, because the latter is easier to grasp (it is thus better to read the article on confidence intervals before this one).

**Two examples**

Let us come back to the example of the alleged increase in French households’ confidence in March 2011 – see the article on confidence intervals. We show in that article how to calculate a confidence interval around the measured increase. The question is: is there indeed an increase, or is there no possibility to tell, because of the random sampling error.

In order to answer that question, we can calculate a confidence interval around the measured variation (3 point, in that case). Doing the test is then basic if 0 is in the confidence interval, we cannot reject the hypothesis that households’ confidence remained stable. If the confidence interval is in indeed [-0,92 ; 6,92], as we are suggesting, then stating that households’ confidence increase in March 2014 is a speculation.

Another example: if a poll gives a 52% score to a candidate, vs 48% to his/her opponent, on the eve of an election, how can we tell whether the difference between the two candidates is significant? Each percentage is an estimate and each is associated with a confidence interval. There are two ways to proceed :

– Check whether 50 is in the confidence interval around 52%. Which is equivalent to check whether 50 is in the confidence interval around 48% : the two confidence intervals have the same size,

– Calculate the difference between the two percentages (52%-48%), calculate the confidence interval around that difference and check whether 0 is in the confidence interval. If yes, we cannot reject the hypothesis that the two candidates actually have the same score (50%) and thus that the poll dos not bring any information.

The two solutions will lead to the same conclusion. The second one (computing the difference between the two percentages) might seem more complex to implement, but is more generic. This is how any test between two percentages will be done: are two scores from two successive waves of a barometer identical? Are the purchase intents for two products different Do two advertising campaigns have the same memory impact? ….

With 1000 interviews, which is the standard sample size for electoral polls, the confidence interval around the 52%-48%=4% is [-0,6% ; +8,4%]. Thus, 0 is in the confidence interval and it is not possible to reject the hypothesis that the two candidates actually are at par. The poll would then bring no information. As discussed in the article on electoral polls, we think that this calculation does not apply here and that electoral polls are more precise than what standard variance calculations would tell.

**Test Statistic**

The outcome of a test can be summarised in a test statistic. In the same way that a confidence interval will always be calculated as:

point estimation +/- 1,96 standard error

a test statistic will always be equal to :

point estimation/standard error

And the hypothesis we test will be rejected at 95% confidence level if the test statistic is larger than 1,96.

Let us stick to the example of a 52% score in an electoral poll. We can build a confidence interval around that 52%.

Suppose we want to test whether the true score of the candidate actually is equal to 50% (and thus that the poll does not bring any information). By essence, under the hypothesis that the true score of the candidate is indeed 50%, the confidence interval has a probability 95% to cover the 50% value. Which is the same to say that, if the true score of the candidate is 50%, there is a 5% probability that the confidence interval does not include that value. Across all confidence intervals we can draw, the random sampling error will generate 5% of them which do not include the true value, even though the score of the candidate is indeed 50%.

This is the first risk of a statistical test: the confidence interval does not include the true value and we come to the conclusion that the true value should be rejected. In the way the test has been built, we know the probability of this happening: here, it is equal to 5%.

But there is also another risk of being wrong. Suppose the true value is not 50%, but that the confidence interval actually includes 50%. Then, we are going to conclude, wrongly, that the scores of the two candidates are equal.

What is the probability of that risk? The only thing that can be said, with the process we used to test, is that that probability is below 95%… Depending on sample size and on the true value, the probability to conclude that the two candidates’ scores are equal, when they aren’t, can be as large as 50%, 60%, 70%.. That second risk has no real upper bound (except if you consider that 95% is an actionable one…).

This will be [48,9% ; 55,1%]. Suppose we want to test whether the true score of the candidate actually is equal to 50% (and thus that the poll does not bring any information). By essence, under the hypothesis that the true score of the candidate is indeed 50%, the confidence interval has a probability 95% to cover the 50% value. Which is the same to say that, if the true score of the candidate is 50%, there is a 5% probability that the confidence interval does not include that value. Across all confidence intervals we can draw, the random sampling error will generate 5% of them which do not include the true value, even though the score of the candidate is indeed 50%. This is the first risk of a statistical test: the confidence interval does not include the true value and we come to the conclusion that the true value should be rejected. In the way the test has been built, we know the probability of this happening: here, it is equal to 5%. But there is also another risk of being wrong. Suppose the true value is not 50%, but that the confidence interval actually includes 50%. Then, we are going to conclude, wrongly, that the scores of the two candidates are equal. What is the probability of that risk? The only thing that can be said, with the process we used to test, is that that probability is below 95%… Depending on sample size and on the true value, the probability to conclude that the two candidates’ scores are equal, when they aren’t, can be as large as 50%, 60%, 70%.. That second risk has no real upper bound (except if you consider that 95% is an actionable one…).

**Level and Power**

In order to assess the quality of a test, two indicators can be calculated:

– The level, which is the probability of rejecting the hypothesis we want to test, when it is actually true (which we don’t want to do). We want to minimise the level.

– The power, which is the probability of rejecting the hypothesis we want to test, when it is indeed wrong (which we want to do). We want to maximise the power.

In the process we described earlier, we have an upper bound for the level, but not idea of the power. And it can be very low, possibly just above the level.

A basic result of statistic is that it is not possible to simultaneously minimise the level and maximise the power of a test: this is actually exactly the same result as for bias and precision that cannot be minimised/maximised simultaneously.

Hence the idea of the aforementioned process: deciding the level of the test (5% for example), and look for a test which power is maximum across all tests with a 5% level.

**Size matters**

How can we increase the power of a test? Only one solution: increase the sample size. On small samples, the tests’ power will be small. As usual, when you have too few observations, statistic has little to tell.

The below graphic displays the power of the electoral poll test, as a function of the true value of the candidate’s score. By essence, if the true value of the candidate’s score is 50%, the power is equal to to the level (5%), whatever the size of the sample. The larger the sample size and the further the true value from the tested one (50%), the higher the power.

Suppose we have 1000 respondents. Then,

– If the true score of the candidates are 50% (i.e., if the election is not decided yet), the probability to be wrong, and to tell that the election is decided, with the process described earlier, is 5%.

– If the true score of one candidate is 51,5% (and thus that the election is decided), the probability to be wrong is 63% !

As mentioned before, we think that electoral polls actually are more precise than what a calculation based on a random sampling assumption would tell. The power of the test is thus probably higher. But this reinforces the intuition that cumulating figures from various research companies will lead to more robust conclusions. Having several firms measuring the same figure is a statistical blessing…

Let us transpose this to the assessment of a new medicine. We want to test whether the medicine is efficient. Let us fictitiously keep the same figures:

– If the medicine is indeed efficient, the probability of wrongly telling that is it not is 5%.

– If the medicine is inefficient, the probability of wrongly telling it is efficient is 63%….

We do not pretend that these figures apply to all testing processes of medicines. But they at least should lead to consider with caution clinical trials on small samples.

**False positive and false negative**

The testing process we just described is based on a dissymmetry across the two hypotheses that are tested. Let us come back to our two examples:

– We test whether the French households’ confidence is significantly different from 0 in March 2014. This boils down to testing two hypotheses one against the other: increase equal to 0 vs increase not equal to 0. In statistical jargon, the first one is the null hypothesis and the second one the alternative.

– We test whether the scores of two candidates in an election run-off are significantly different. This is also testing two hypotheses against each other: equal scores (and thus equal to 50%) against different scores. Again, null hypotheses against the alternative.

The dissymmetry across the two hypotheses come from the way the risk of being mistaken is controlled, whether on or the other is true:

– The level of the test is controlled, i.e. the probability of being wrong if the null hypothesis is true (thus, the probability of saying that the confidence increases when it does not, or the probability to say there is a winner, when the scores are actually not statistically different). We can choose the threshold for the level as we wish : 5%, 1%, 0,1%….

– We do not control the power of the test, and thus the probability of being wrong if the alternative is true : if the households’ confidence does indeed increase, our process will not allow us to control the probability of stating it is stable ; if the scores of the two candidates are different, we don’t control the probability of stating they are equal. And we have seen that, even with a sample of 1000 respondents, that probability can be significantly higher than the level.

A false positive arises when the null hypothesis is wrongly rejected. A false negative arises when the alternative is wrongly rejected. The testing process allows to control the number of false positives, but not the number of false negatives. These false negatives are the real enemies for the statistician in his data driven decision process.

Thus the choice of the null hypothesis will not be innocent. Du coup, le choix de l’hypothèse nulle à tester ne sera pas innocent. The pharma company will prefer the null hypothesis “the medicine is efficient”. The patient will probably prefer the reverse…

We have shown in the article on confidence intervals that the shorter confidence interval is the symmetric one. Which is a clear rational for the choice of a symmetric interval.

There is another, purely statistical, rational for that choice: choosing a symmetric confidence interval, will ensure that the power is always above the level, which is a not asking much… We then have an unbiased test

As for confidence intervals, the outcome of a test will depend on three parameters:

– The intrinsic dispersion of the data (the variance),

– The sample size,

– The confidence level for the test.

The universal habit is to use a 9% confidence level for the interval and thus a 5% level for the test: if the hypothesis that is tested is true, then there is a 5% chance of mistakenly saying it should be rejected, i.e. a 5% of drawing a sample which will lead to that conclusion.

This 5% threshold is arbitrary and is not theoretically grounded in anyway. It is then quite tempting to choose the level of the test according to the story he/she want to tell. Any data processing software will allow the practicing statistician to output this type of table:

Shop 1 (A) | Shop 2 (B) | Shop 3 (C) | Shop 4 (D) | |

Satisfaction towards the shop | 82%BCd | 65%AcD | 71%Abd | 76%aBc |

This table allow the reader to test whether satisfaction with one shop is significantly different from the satisfaction with another shop: the satisfaction with shop 1 is significantly above satisfaction with shops 2 and 3 at 95% level. And significantly above satisfaction with shop 4 at 90% level. This is materialised by upper case letters (for the 95% confidence level) and lower case ones (for the 90% confidence level), which indicated the tested columns. Being compact is the advantage of this kind of visualisation: one can see immediately the tests results for different levels. But it also has a drawback: it is easy, and tempting, not to use the same level for tests dealing with the same dataset.

In the above example, shops B and C are significantly below shop 1 at 95% and 90%. Then, according to the choice of the level, the story can be that A and D are similar, or different.

Our recommendation: choose only one level for a given dataset and don’t try to present different test outcomes with different levels on the same dataset. This can only confuse your audience. But it is fully rational to use different levels for different sample sizes. The next sections of this article explain why. As a rule of thumb : use a 90% level when you have less than 300 repondents, a 99% when you have more than 2000 and 95% in between.

**p-value**

The p-value tell you which level of the test you should go for in order not to rejct an hypothesis you are testing. (p is for probability. A p-value is a probability). This indicator is a function of the distance between the data we have and the data you would need to reject the tested hypothesis. It can be directly calculated from the test statistic: computing the p-value is equivalent to commuting the test statistic.

Suppose we want to test that the scores of two candidates in an election run off are equal, as in section 1. The table below displays the p-value and the corresponding test statistic, with the level of the test at which you would accept the two scores are equal.

As long as the p-value is above 5%, you cannot reject the hypothesis that the two scores are equal at 5% level.

Test Statistic | p-value | Comment |

0,50 | 0,6171 | Accept at 5%. Reject at 62%….. |

1,95 | 0,0512 | Accept at 5%. Reject at 6% |

2,50 | 0,0124 | Reject at 5%. Accept at 1%. |

6,00 | <0,0001 | Reject at 5% and 1%. |

Statisticians might encountered other issues, at the opposite end of the weak power mentioned in section 2.

Suppose you are engaged in multivariate modelling, for example a linear regression. In that type of modelling, testing that explanatory variables are not statistically significant (in explaining the dependent variable), or that two explanatory variables have the same impact (on the dependent variable). The latter one boils down to testing that two coefficients in the regression are equal. The statistical test will tell us whether the two coefficients are exactly equal, when quite often, the empirical statistician want to know whether they can be considered approximately equal.

Let us have a look, for example, at the non voting behaviour in the first round of the French parliamentary elections in June 2012. The larger the size of their town the less people are voting. The below graphic displays the percentage of non voters according to the number of people registered to vote in the town (in deciles): the more the former, the more the latter. There figures are calculated on the total population. There is no random sampling error. For example, there is indeed a small difference in the non voters numbers between decile 7 (37,5%) and decile 8 (38,0%). Empirically, this difference is immaterial if you compare it to the differences across other towns’ segments. In a model where you would use the non voters’ rate as an explanatory variable, you would probably consider the rate to be the same across those two segments, just to have a more parsimonious model.

If we did not have access to the reference figures, we would perform a statistical test, using a sample of towns. Here are the results of various tests, according to the sample size (remember sampling units are the towns):

Number of towns | Test Statistic | p-value |

1100 | 0,58 | 0,45 |

2150 | 1,34 | 0,25 |

11000 | 3,32 | 0,07 |

15000 | 6,14 | 0,01 |

33000 | 8,07 | 0,00 |

As long as the test is done with less than 11000 towns, the hypothesis that the two abstaining rates are equal across the two towns segments cannot be rejected, at 5% level. Then, as the sample size increases, the hypothesis is rejected.

This table is just another illustration of the graphic in section 2 about test power. The hypothesis we want to test is not exactly true, but approximately true (which is what matters in practice). When the sample size is increasing, the test power goes to 1, as soon as you are just aside of the hypothesis you are testing.

Any data analysts has encountered the issue: with large sample sizes, it is almost impossible to accept any test hypothesis, which is detrimental to model parsimony and easiness of communication.

The test theory we talked about in this article is widely used because it is easy to implement and to understand. It has however two weaknesses: small power on small samples, too large power on large samples. These originate from the same source: the dissymmetry between the two tested hypotheses.

The Bayesian theory of tests allows one to avoid that dissymmetry.

Bayesan estimation starts from the recognition that we in general have an a priori idea about the model we want to estimate and its parameters. Suppose for example we want to estimate a model forecasting soccer games. We want to estimate the probability of winning of a given team (p). The model will take into account that this probability might not be the same when the team is playing at home or not: the probability of winning when not playing at home is p and the probability of winning at home is p+δ. δ is a priori positive, maybe equal to 0, probably not negative. We will model that assuming that δ has a probability distribution with strictly positive mean, with some probability of being equal to 0 (mathematical details are here). This probability distribution is stated a priori, before any number crunching.

Bayesian estimation will look at how this a priori will be modified once the data have been analysed. The a priori probability distribution will be confronted with the data, and we will infer from that confrontation a new probability distribution for δ, called a posteriori distribution. Bayesian estimation is called after the name of a 18^{th} century English mathematician, reverend Bayes, who derived the Bayes formula, which allows one to calculate the conditional probability of an event, conditionally to another event. In our case, the a posteriori law of δ is the conditional law of δ, conditional to the data, and it can be calculated from the a priori law, and from the data probability distribution, conditional to δ. Put more simply, the a priori law of δ is updated after analysing the data, as summarised below:

Before data collection | Data collection | Data analysis |

A priori on δ | They follow a law conditional to δ | Update of the a priori on δ, conditional to the observed data. |

Suppose we want to test that δ is equal to 0. A Bayesian test will consist in comparing the probability that δ is equal to 0, conditionally to the data, and the probability that δ is different from 0, conditionally to the data. If the former is larger than the latter, we will accept the hypothesis that δ is equal to 0. More elaborate versions of Bayesian tests would give different weights to the different hypotheses, according to the risk of being wrong, but this does not change the reasoning.

What would be the difference between a test based on confidence intervals and a Bayesian test? Both will be based on the difference between the probability of winning at home or not. A soon as this difference is too large, the hypothesis that δ is equal to 0 will be rejected. Bayesian tests bring with them a corrective coefficient which increases with sample size: when the test based on the confidence interval at 95% will reject the hypothesis δ=0, the Bayesian test will still accept it, which is a way out of the issue mentioned in section 5.

We are interested in two new products, A and B, which were tested with 400 respondents. We thus have 800 interviews, 400 for product A and 400 for product B. The purchase intent is 30% for A and 35% for B. Are those purchase intents significantly different, thus giving an edge to product B?

S. Kullback (1959): Information Theory and Statistics – Wiley

T.S. Ferguson (1967) : Mathematical Statistics – Academic Press

J.P. Lecoutre (2012) : Statistique et probabilités – Dunod

A. Monfort (1982) : Cours de statistique mathématique –- Economica

S.D. Silvey (1975) : Statistical inference – Chapman and Hall

The a priori probability distribution of δ is

– δ = 0 with probability P

– δ follow a Gaussian law with mean m and variance 1 with probability 1-P.

We will assume there are as many games at home as outside home. δ can be estimated as the percentages of games won at home minus the percentage of games won outside home. This difference will be noted Z.

The conditional law of Z knowing δ is a Gaussian law with mean δ and variance σ/N, where N is the number of games at home. This approximation is only valid if N is large.

We thus have to calculate the probability distribution of δ conditional to Z.

The probability that δ is equal to 0, conditionally to Z, is :

P * squareroot(N/(2πσ))*exp(-N/(2σ)*Z²)/l(Z)

Where l(Z) is the non conditional law of Z.

The probability of δ being different from 0, conditionally to Z, can be calculated by integrating over δ the product of the law of δ with the conditional law of Z knowing δ. This can be written:

(1 – P)/racine(2π)*racine(N/σ/(N/σ+1))*exp(-N/(2σ)*(Z-m)²/(N/σ+1))/l(Z)

The test based on confidence intervals will reject the hypothesis δ = 0 if Z is larger than c/squareroot(N), where c depends on the level of the test and the law of Z.

The Bayesian test will reject δ = 0 if the probability that δ is equal to 0, conditionally to Z, is smaller than the probability that δ is different from 0, conditionally to Z. After calculation, this means that the hypothesis δ = 0 will be rejected if Z is larger than a coefficient which is behaving like squareroot(ln(N)/N), thus less quickly (as a function of the sample size) than the test based on confidence intervals. .