Sampling from a Population - Computational and Inferential Thinking

The law of averages also holds when the random sample is drawn from individuals in a large population.

As an example, we will study a population of flight delay times. The table united contains data for United Airlines domestic flights departing from San Francisco in the summer of 2015. The data are made publicly available by the Bureau of Transportation Statistics in the United States Department of Transportation.

There are 13,825 rows, each corresponding to a flight. The columns are the date of the flight, the flight number, the destination airport code, and the departure delay time in minutes. Some delay times are negative: those flights left early.

from datascience import *
path_data = '../../../assets/data/'
import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import numpy as np

united = Table.read_table(path_data + 'united_summer2015.csv')
united

One flight departed 16 minutes early, and one was 580 minutes late. The other delay times were almost all between -10 minutes and 200 minutes, as the histogram below shows.

united.column('Delay').min()

-16

united.column('Delay').max()

580

delay_bins = np.append(np.arange(-20, 301, 10), 600)
united.hist('Delay', bins = delay_bins, unit = 'minute')

Histogram with 'Delay (minute)' on the x-axis and 'Percent per minute' on the y-axis. The tallest bars are on the left hand side of the graph, close to 0. The height of the bars quickly decrease, but there is a long right tail that extends to 600.

For the purposes of this section, it is enough to zoom in on the bulk of the data and ignore the 0.8% of flights that had delays of more than 200 minutes. This restriction is just for visual convenience; the table still retains all the data.

united.where('Delay', are.above(200)).num_rows/united.num_rows

0.008390596745027125

delay_bins = np.arange(-20, 201, 10)
united.hist('Delay', bins = delay_bins, unit = 'minute')

Histogram with 'Delay (minute) on the x-axis and 'Percent per minute' on the y-axis. The tallest bars are between -10 and 10 and there is a quick drop off in the height of the bars after that. Bars are visible, but very small until about x=140. The graph continues to extend to x=200.

The height of the [0, 10) bar is just under 3% per minute, which means that just under 30% of the flights had delays between 0 and 10 minutes. That is confirmed by counting rows:

united.where('Delay', are.between(0, 10)).num_rows/united.num_rows

0.2935985533453888

10.2.1Empirical Distribution of the Sample¶

Let us now think of the 13,825 flights as a population, and draw random samples from it with replacement. It is helpful to package our code into a function. The function empirical_hist_delay takes the sample size as its argument and draws an empiricial histogram of the results.

def empirical_hist_delay(n):
    united.sample(n).hist('Delay', bins = delay_bins, unit = 'minute')

As we saw with the dice, as the sample size increases, the empirical histogram of the sample more closely resembles the histogram of the population. Compare these histograms to the population histogram above.

empirical_hist_delay(10)

Histogram with 'Delay (minute)' on the x-axis and 'Percent per minute' on the y-axis. The x-axis extends from about -10 to 200. There are three bars with non-zero height, from -10 to 0 with middle height, 0 to 10 with the tallest height, and 10 to 20 with the shortest height.

empirical_hist_delay(100)

Histogram with 'Delay (minute)' on the x-axis and 'Percent per minute' on the y-axis. The tallest bars are between -10 and 20. There are a number of short, non-zero height bars between 20 and 110.

The most consistently visible discrepancies are among the values that are rare in the population. In our example, those values are in the the right hand tail of the distribution. But as the sample size increases, even those values begin to appear in the sample in roughly the correct proportions.

empirical_hist_delay(1000)

Histogram with 'Delay (minute)' on the x-axis and 'Percent per minute' on the y-axis. The tallest bars are again between -10 and 20 with short, non-zero height bars extending to about 150.

10.2.2Convergence of the Empirical Histogram of the Sample¶

What we have observed in this section can be summarized as follows:

For a large random sample, the empirical histogram of the sample resembles the histogram of the population, with high probability.

This justifies the use of large random samples in statistical inference. The idea is that since a large random sample is likely to resemble the population from which it is drawn, quantities computed from the values in the sample are likely to be close to the corresponding quantities in the population.

10.2 Sampling from a Population

10.2.1Empirical Distribution of the Sample¶

10.2.2Convergence of the Empirical Histogram of the Sample¶