A representative sample of respondents/consumers/clients doesn’t translate automatically in unbiased statistics. As soon as you are interested in durations (visit duration, readership duration, waiting duration), for example, you need to be careful. Where you will also learn why waiting for the next train always seems too long…

It is fully standard now that waiting times before the next train or bus are posted in stations. This allows transportation providers to manage clients’ expectations, and impatience. Thus, the above question becomes irrelevant. Still, a small exercise for our readers.

Suppose trains or buses go out every 5 minutes from the terminal. The average interval between two trains or buses will be 5’. What is then the average waiting time for a client arriving in a given station? The answer is in the next section: just give it a thought for a few seconds before going on.

A typical answer to that question is the following: there is no reason why the buses/trains should have something against me specifically. Thus, I will arrive sometimes just after the departure of a train, sometimes just before its arrival in the station, and mostly uniformly between the two. The waiting time thus will be half the average interval between two trains, thus 2’30”. Tempting, but wrong reasoning.

The graphic below show when trains (or buses) are alighting at the nearest station : After a first stop in 0, trains will alight at T_{1}, T_{2}, T_{3,}…And you will arrive at time t: t will be somewhere uniformly random on the red line. By nature, t has more chance to happen during a long duration (for example, T_{2}-T_{1, }T_{6}-T_{5}, T_{8}-T_{7}) than during a short duration (for example, T_{4}-T_{3}). In terms of sampling, if we were to draw a sample of durations that cover the time of your arrival in the station, that sample would be made of durations on average longer that the average duration between two trains. Because the probability of arriving during a long duration is higher than the probability of arriving during a short duration, which the previous reasoning was overlooking.

It can be shown that, if the arrival of trains in a station follows a Poisson process, of mean m, the average waiting time will be precisely m. The answer to the question, in that specific framework, would then be 5!. Far away from the 2’30’’.

That type of sampling is called endogenous sampling. The measurement of the variable of interest is, by nature, biased.

A classic example of endogenous sample can be found it studies on unemployment. A sample of unemployed, drawn at time t from the lists of the public job agency, would be a biased sample, with longer than average unemployment durations: log unemployment durations have a larger probability to include time t than short ones. One way to have a representative sample of unemployed would be to include in the sample all people registering at the job agency for some time period for example.

Another interesting example comes from the readership measurement of dailies or magazines. A standard audience indicator is the Average Issue Readership (AIR): reading a publication within its publication interval, in the last day, the last week or the last month. AIR is calculated by asking the interviewee about his/her last reading date. This is precisely measuring a duration: the duration since the last time the interviewee read the publication. As any duration which is sampled at time t, its measurement is biased because of endogenous sampling. Longer durations are oversampled: the duration since the last reading date, as measured in the sample of a classic readership survey, will be larger than the average duration in the reference population.

For example, assume the duration between two reading times is distributed as an exponential law (the equivalent of a Poisson process tor the times of reading). Then, the measured duration will be on average equal to the average duration between two readings. Thus larger than the duration since the last reading date: the interviewee will in most cases not read again the publication on the date of the interview.

The impact on AIR is not straightforward. You could think that, the duration since the last reading date being overestimated, AIR will be underestimated. This actually depends on how the probability of reading again is evolving over time.

Only on of the below sample schemes is not endogenous. Which one? ** **

After you have voted, you will be able to see how the others have voted.

Endogeneity

D.R. Cox and D. Oakes. (1984): Analysis of Survival Data – Chapman and Hall J.J. Heckman and B. Singer (1984): Econometric duration analysis – Journal of Econometrics 24, 63–132. T. Lancaster (1990): The Econometric Analysis of Transition Data – Cambridge University Press.

The graphic above gives the notations used. Y_{i }is the average duration between two trains, which will be assumed t follow an exponential law with mean M. We want to fond the distribution law of T_{nt+1}-t, and find its mean. Let N_{t} be the number of trains who alighted in the station before t : N_{t} is distributed as a Poisson law, of parameter t/M, and N_{t+u}-N_{t} as a Poisson law of parameter u/M. We would like to calculate the probability that T_{nt+1}-t is larger than u. We have:

T_{nt+1}-t > u is equivalent to N_{t+u}-N_{t} = 0

Thus, the probability that T_{nt+1}-t > u is exp(-u/M), and the expectation of T_{nt+1}-t is equal to M.In much the same way,t -T_{nt} is distributed as an exponential law of mean M, and the sum of the two is a gamma law (2, 1/M). Thus, the expectation of T_{nt+1}-T_{nt} is equal to 2M.