13.3. Confidence Intervals#
We have developed a method for estimating a parameter by using random sampling and the bootstrap. Our method produces an interval of estimates, to account for chance variability in the random sample. By providing an interval of estimates instead of just one estimate, we give ourselves some wiggle room.
In the previous example we saw that our process of estimation produced a good interval about 95% of the time, a “good” interval being one that contains the parameter. We say that we are 95% confident that the process results in a good interval. Our interval of estimates is called a 95% confidence interval for the parameter, and 95% is called the confidence level of the interval.
The method is called the bootstrap percentile method because the interval is formed by picking off two percentiles of the bootstrapped estimates.
The situation in the previous example was a bit unusual. Because we happened to know the value of the parameter, we were able to check whether an interval was good or a dud, and this in turn helped us to see that our process of estimation captured the parameter about 95 out of every 100 times we used it.
But usually, data scientists don’t know the value of the parameter. That is the reason they want to estimate it in the first place. In such situations, they provide an interval of estimates for the unknown parameter by using methods like the one we have developed. Because of statistical theory and demonstrations like the one we have seen, data scientists can be confident that their process of generating the interval results in a good interval a known percent of the time.
13.3.1. Estimating a Population Median#
We will now use the bootstrap method to estimate an unknown population median. You have encountered the dataset before. It comes from a sample of newborns in a large hospital system. we will treat it as if it were a simple random sample though the sampling was done in multiple stages. Stat Labs by Deborah Nolan and Terry Speed has details about a larger dataset from which this set is drawn.
The table births
contains the following variables for mother-baby pairs: the baby’s birth weight in ounces, the number of gestational days (the number of days the mother was pregnant), the mother’s age in completed years, the mother’s height in inches, pregnancy weight in pounds, and whether or not the mother smoked during pregnancy.
births = Table.read_table(path_data + 'baby.csv')
births.show(3)
Birth Weight | Gestational Days | Maternal Age | Maternal Height | Maternal Pregnancy Weight | Maternal Smoker |
---|---|---|---|---|---|
120 | 284 | 27 | 62 | 100 | False |
113 | 282 | 33 | 64 | 135 | False |
128 | 279 | 28 | 64 | 115 | True |
... (1171 rows omitted)
Birth weight is an important factor in the health of a newborn infant. Smaller babies tend to need more medical care in their first days than larger newborns. It is therefore helpful to have an estimate of birth weight before the baby is born. One way to do this is to examine the relationship between birth weight and the number of gestational days.
A simple measure of this relationship is the ratio of birth weight to the number of gestational days. The table ratios
contains the first two columns of baby
, as well as a column of the ratios. The first entry in that column was calculated as follows:
ratios = births.select('Birth Weight', 'Gestational Days').with_columns(
'Ratio BW:GD', births.column('Birth Weight')/births.column('Gestational Days')
)
ratios
Birth Weight | Gestational Days | Ratio BW:GD |
---|---|---|
120 | 284 | 0.422535 |
113 | 282 | 0.400709 |
128 | 279 | 0.458781 |
108 | 282 | 0.382979 |
136 | 286 | 0.475524 |
138 | 244 | 0.565574 |
132 | 245 | 0.538776 |
120 | 289 | 0.415225 |
143 | 299 | 0.478261 |
140 | 351 | 0.39886 |
... (1164 rows omitted)
Here is a histogram of the ratios.
ratios.select('Ratio BW:GD').hist()
At first glance the histogram looks quite symmetric, with the density at its maximum over the interval 0.4 ounces per day to 0.45 ounces per day. But a closer look reveals that some of the ratios were quite large by comparison. The maximum value of the ratios was just over 0.78 ounces per day, almost double the typical value.
ratios.sort('Ratio BW:GD', descending=True).take(0)
Birth Weight | Gestational Days | Ratio BW:GD |
---|---|---|
116 | 148 | 0.783784 |
The median gives a sense of the typical ratio because it is unaffected by the very large or very small ratios. The median ratio in the sample is about 0.429 ounces per day.
percentile(50, ratios.column(2))
0.42907801418439717
But what was the median in the population? We don’t know, so we will estimate it.
Our method will be exactly the same as in the previous section. We will bootstrap the sample 5,000 times resulting in 5,000 estimates of the median. Our 95% confidence interval will be the “middle 95%” of all of our estimates.
13.3.1.1. Constructing a Bootstrap Confidence Interval#
We will start by defining a function one_bootstrap_median
. It will bootstrap the sample and return one the median ratio in the bootstrapped sample.
def one_bootstrap_median():
resample = ratios.sample()
return percentile(50, resample.column('Ratio BW:GD'))
Run the cell below to see how the bootstrapped ratios vary. Remember that each of them is an estimate of the unknown ratio in the population.
one_bootstrap_median()
0.43010752688172044
Now we can use a for
loop to generate 5000 bootstrapped medians.
# Generate medians from 5000 bootstrap samples
num_repetitions = 5000
bstrap_medians = make_array()
for i in np.arange(num_repetitions):
bstrap_medians = np.append(bstrap_medians, one_bootstrap_median())
# Get the endpoints of the 95% confidence interval
left = percentile(2.5, bstrap_medians)
right = percentile(97.5, bstrap_medians)
make_array(left, right)
array([0.42545455, 0.43272727])
The 95% confidence interval goes from about 0.425 ounces per day to about 0.433 ounces per day. We are estimating the median “birth weight to gestational days” ratio in the population is somewhere in the interval 0.425 ounces per day to 0.433 ounces per day.
The estimate of 0.429 based on the original sample happens to be half-way in between the two ends of the interval, though that need not be true in general.
To visualize our results, let us draw the empirical histogram of our bootstrapped medians and place the confidence interval on the horizontal axis.
resampled_medians = Table().with_columns(
'Bootstrap Sample Median', bstrap_medians
)
resampled_medians.hist(bins=15)
plots.plot([left, right], [0, 0], color='yellow', lw=8);
This histogram and interval resembles those we drew in the previous section, with one big difference – there is no green dot showing where the parameter is. We don’t know where that dot should be, or whether it is even in the interval.
We just have an interval of estimates. It is a 95% confidence interval of estimates, because the process that generates it produces a good interval about 95% of the time. That certainly beats guessing the ratio at random!
Keep in mind that this interval is an approximate 95% confidence interval. There are many approximations involved in its computation. The approximation is not bad, but it is not exact.
13.3.2. Estimating a Population Average#
What we have done for medians can be done for averages as well. Suppose we want to estimate the average age of the mothers in the population. A natural estimate is the average age of the mothers in the sample. Here is the distribution of their ages, and their average age which was about 27.2 years.
births.select('Maternal Age').hist()
np.average(births.column('Maternal Age'))
27.228279386712096
What was the average age of the mothers in the population? We don’t know the value of this parameter.
Let’s estimate the unknown parameter by the bootstrap method. To do this, we will adapt the code for bootstrap_median
to instead define the function bootstrap_mean
. The code is the same except that the statistics are means (that is, averages) instead of medians, and are collected in an array called bstrap_means
instead of bstrap_medians
.
def one_bootstrap_mean():
resample = births.sample()
return np.average(resample.column('Maternal Age'))
# Generate means from 5000 bootstrap samples
num_repetitions = 5000
bstrap_means = make_array()
for i in np.arange(num_repetitions):
bstrap_means = np.append(bstrap_means, one_bootstrap_mean())
# Get the endpoints of the 95% confidence interval
left = percentile(2.5, bstrap_means)
right = percentile(97.5, bstrap_means)
make_array(left, right)
array([26.90630324, 27.55962521])
The 95% confidence interval goes from about 26.9 years to about 27.6 years. That is, we are estimating that the average age of the mothers in the population is somewhere in the interval 26.9 years to 27.6 years.
Notice how close the two ends are to the average of about 27.2 years in the original sample. The sample size is very large – 1,174 mothers – and so the sample averages don’t vary much. We will explore this observation further in the next chapter.
The empirical histogram of the 5,000 bootstrapped mean ages is shown below, along with the 95% confidence interval for the population mean age.
resampled_means = Table().with_columns(
'Bootstrap Sample Mean', bstrap_means
)
resampled_means.hist(bins=15)
plots.plot([left, right], [0, 0], color='yellow', lw=8);
Once again, the average of the original sample (27.23 years) is close to the center of the interval. That’s not very surprising, because each bootstrapped sample is drawn from that same original sample. The averages of the bootstrapped samples are about symmetrically distributed on either side of the average of the sample from which they were drawn.
Notice also that the empirical histogram of the resampled means has roughly a symmetric bell shape, even though the histogram of the sampled ages was not symmetric at all:
births.select('Maternal Age').hist()
This is a consequence of the Central Limit Theorem of probability and statistics. In later sections, we will see what the theorem says.
13.3.3. An 80% Confidence Interval#
You can use the bootstrapped sample means to construct an interval of any level of confidence. For example, to construct an 80% confidence interval for the mean age in the population, you would take the “middle 80%” of the resampled means. So you would want 10% of the distribution in each of the two tails, and hence the endpoints would be the 10th and 90th percentiles of the resampled means.
left_80 = percentile(10, bstrap_means)
right_80 = percentile(90, bstrap_means)
make_array(left_80, right_80)
array([27.01277683, 27.44293015])
resampled_means.hist(bins=15)
plots.plot([left_80, right_80], [0, 0], color='yellow', lw=8);
This 80% confidence interval is much shorter than the 95% confidence interval. It only goes from about 27.0 years to about 27.4 years. While that’s a tight set of estimates, you know that this process only produces a good interval about 80% of the time.
The earlier process produced a wider interval but we had more confidence in the process that generated it.
To get a narrow confidence interval at a high level of confidence, you’ll have to start with a larger sample. We’ll see why in the next chapter.
13.3.4. Estimating a Population Proportion#
In the sample, 39% of the mothers smoked during pregnancy.
births.where('Maternal Smoker', are.equal_to(True)).num_rows / births.num_rows
0.3909710391822828
Remember that a proportion is an average of zeros and ones. So the proportion of mothers who smoked could also be calculated using array operations as follows.
smoking = births.column('Maternal Smoker')
np.count_nonzero(smoking) / len(smoking)
0.3909710391822828
What percent of mothers in the population smoked during pregnancy? This is an unknown parameter which we can estimate by a bootstrap confidence interval. The steps are analogous to those we took to estimate the population median and mean.
In a process that is now familiar, will start by defining a function one_bootstrap_proportion
that bootstraps the sample and returns the proportion of smokers in the bootstrapped sample. Then we will call the function multiple times using a for
loop, and get the 2.5th perentile and 97.5th percentiles of the bootstrapped proportions.
def one_bootstrap_proportion():
resample = births.sample()
smoking = resample.column('Maternal Smoker')
return np.count_nonzero(smoking) / len(smoking)
# Generate proportions from 5000 bootstrap samples
bstrap_proportions = make_array()
num_repetitions = 5000
for i in np.arange(num_repetitions):
bstrap_proportions = np.append(bstrap_proportions, one_bootstrap_proportion())
# Get the endpoints of the 95% confidence interval
left = percentile(2.5, bstrap_proportions)
right = percentile(97.5, bstrap_proportions)
make_array(left, right)
array([0.36286201, 0.41908007])
The confidence interval goes from about 36% to about 42%.
resampled_proportions = Table().with_columns(
'Bootstrap Sample Proportion', bstrap_proportions
)
resampled_proportions.hist(bins=15)
plots.plot([left, right], [0, 0], color='yellow', lw=8);
13.3.5. Care in Using the Bootstrap Percentile Method#
The bootstrap is an elegant and powerful method. Before using it, it is important to keep some points in mind.
Start with a large random sample. If you don’t, the method might not work. Its success is based on large random samples (and hence also resamples from the sample) resembling the population. The Law of Averages says that this is likely to be true provided the random sample is large.
To approximate the probability distribution of a statistic, it is a good idea to replicate the resampling procedure as many times as possible. A few thousand replications will result in decent approximations to the distribution of sample median, especially if the distribution of the population has one peak and is fairly symmetric. We used 5,000 replications in our examples but would recommend 10,000 in general.
The bootstrap percentile method works well for estimating the population median or mean based on a large random sample. However, it has limitations, as do all methods of estimation. For example, it is not expected to do well in the following situations.
The goal is to estimate the minimum or maximum value in the population, or a very low or very high percentile, or parameters that are greatly influenced by rare elements of the population.
The probability distribution of the statistic is not roughly bell shaped.
The original sample is very small, say less than 10 or 15.