An interval can hide another

A representative sample of respondents/consumers/clients doesn’t translate automatically in unbiased statistics. As soon as you are interested in durations (visit duration, readership duration, waiting duration), for example, you need to be careful. Where you will also learn why waiting for the next train always seems too long…

When is the next train?

It is fully standard now that waiting times before the next train or bus are posted in stations. This allows transportation providers to manage clients’ expectations, and impatience. Thus, the above question becomes irrelevant. Still, a small exercise for our readers.

Suppose trains or buses go out every 5 minutes from the terminal. The average interval between two trains or buses will be 5’. What is then the average waiting time for a client arriving in a given station? The answer is in the next section: just give it a thought for a few seconds before going on.

A typical answer to that question is the following: there is no reason why the buses/trains should have something against me specifically. Thus, I will arrive sometimes just after the departure of a train, sometimes just before its arrival in the station, and mostly uniformly between the two. The waiting time thus will be half the average interval between two trains, thus 2’30”. Tempting, but wrong reasoning.

The graphic below show when trains (or buses) are alighting at the nearest station : 10443907 10203939248780311 591670152 n An interval can hide another After a first stop in 0, trains will alight at T₁, T₂, T_3,…And you will arrive at time t: t will be somewhere uniformly random on the red line. By nature, t has more chance to happen during a long duration (for example, T₂-T_1,T₆-T₅, T₈-T₇) than during a short duration (for example, T₄-T₃). In terms of sampling, if we were to draw a sample of durations that cover the time of your arrival in the station, that sample would be made of durations on average longer that the average duration between two trains. Because the probability of arriving during a long duration is higher than the probability of arriving during a short duration, which the previous reasoning was overlooking.

It can be shown that, if the arrival of trains in a station follows a Poisson process, of mean m, the average waiting time will be precisely m. The answer to the question, in that specific framework, would then be 5!. Far away from the 2’30’’.

That type of sampling is called endogenous sampling. The measurement of the variable of interest is, by nature, biased.

A few classic examples of endogenous sampling

A classic example of endogenous sample can be found it studies on unemployment. A sample of unemployed, drawn at time t from the lists of the public job agency, would be a biased sample, with longer than average unemployment durations: log unemployment durations have a larger probability to include time t than short ones. One way to have a representative sample of unemployed would be to include in the sample all people registering at the job agency for some time period for example.

Another interesting example comes from the readership measurement of dailies or magazines. A standard audience indicator is the Average Issue Readership (AIR): reading a publication within its publication interval, in the last day, the last week or the last month. AIR is calculated by asking the interviewee about his/her last reading date. This is precisely measuring a duration: the duration since the last time the interviewee read the publication. As any duration which is sampled at time t, its measurement is biased because of endogenous sampling. Longer durations are oversampled: the duration since the last reading date, as measured in the sample of a classic readership survey, will be larger than the average duration in the reference population.

For example, assume the duration between two reading times is distributed as an exponential law (the equivalent of a Poisson process tor the times of reading). Then, the measured duration will be on average equal to the average duration between two readings. Thus larger than the duration since the last reading date: the interviewee will in most cases not read again the publication on the date of the interview.

The impact on AIR is not straightforward. You could think that, the duration since the last reading date being overestimated, AIR will be underestimated. This actually depends on how the probability of reading again is evolving over time.

Your turn

Only on of the below sample schemes is not endogenous. Which one?

After you have voted, you will be able to see how the others have voted.

Related Item

Bias

Endogeneity

References

D.R. Cox and D. Oakes. (1984): Analysis of Survival Data – Chapman and Hall J.J. Heckman and B. Singer (1984): Econometric duration analysis – Journal of Econometrics 24, 63–132. T. Lancaster (1990): The Econometric Analysis of Transition Data – Cambridge University Press.

Proof waiting time

10472176 10203939361143120 1813769130 n An interval can hide another The graphic above gives the notations used. Y_iis the average duration between two trains, which will be assumed t follow an exponential law with mean M. We want to fond the distribution law of T_nt+1-t, and find its mean. Let N_t be the number of trains who alighted in the station before t : N_t is distributed as a Poisson law, of parameter t/M, and N_t+u-N_t as a Poisson law of parameter u/M. We would like to calculate the probability that T_nt+1-t is larger than u. We have:

T_nt+1-t > u is equivalent to N_t+u-N_t = 0

Thus, the probability that T_nt+1-t > u is exp(-u/M), and the expectation of T_nt+1-t is equal to M.In much the same way,t -T_nt is distributed as an exponential law of mean M, and the sum of the two is a gamma law (2, 1/M). Thus, the expectation of T_nt+1-T_nt is equal to 2M.

Statpedia