The next IREP Médias conference will deal with the new perceptions of timescales (new temporalities). From a statistical point of view, analysing temporal data often raises complex issues, more complex ones than analysing individual data at time t. Issues actually start with sampling and few sampling specialists are aware of this.

In order to understand this, here is a small exercise for our readers who are active Tube users. Suppose trains go out every 5 minutes from their terminal. The average interval between two trains will be 5’. What is then the average waiting time for a client arriving in a given station? The answer is in the next section: just give it a thought for a few seconds before going on.

A typical answer to that question is the following: there is no reason why the train service provider should have something against me specifically. Thus, I will arrive sometimes just after the departure of a train, sometimes just before its arrival in the station, and mostly uniformly between the two. The waiting time thus will be half the average interval between two trains, thus 2’30”. Tempting, but wrong reasoning.

It can be shown that the average waiting time is actually 5’ (here is the proof with the underlying assumptions). Simply because the probability of arriving during a long duration is higher than the probability of arriving during a short duration, which the previous reasoning was overlooking.

In terms of sampling, if we were to draw a sample of durations that cover the time of your arrival in the station, that sample would be made of durations on average longer that the average duration between two trains.

This type of endogenous sampling (or stock sampling) is quite well known. As a young researcher at the Département de la recherché of the French statistical office, I was working on the drivers of unemployment duration (see here or here). An already – at the time – well documented phenomenon is that a sample of unemployed extracted at time t from the lists of the National Job Agency is biased: longer duration are overrepresented.

This is exactly what happens with the audience measurement of print Medias. A standard audience indicator is the Average Issue Readership (AIR): reading a publication within its publication interval, in the last day, the last week or the last month AIR is calculated by asking the interviewee about his/her last reading date. This is precisely measuring a duration: the duration since the last time the interviewee read the publication. As any duration which is sampled at time t, its measurement is biased because of endogenous sampling. Longer durations are oversampled: the duration since the last reading date, as measured in the sample of a classic readership survey, will be larger than the average duration in the reference population

The impact on AIR is not straightforward, as it depends on the way the probability of first reading evolves in time. In a simple model where this probability increases, and then decreases, and a 12 days average (every other issue of a weekly magazine is rad), AIR is underestimated by 10%.

Quite paradoxical to think about the care with which these surveys are sampled, when the questioning mode biases the results as soon as the respondent opens his/her mouth (or click on his/her screen). One more proof that the future of research is not in the sophistication of sampling methods, but in expert analytics, which can handle the bias in data collection.

Antoine Moreau

24/11/2014