Sampling and Empirical Distributions

10. Sampling and Empirical Distributions#

An important part of data science consists of making conclusions based on the data in random samples. In order to correctly interpret their results, data scientists have to first understand exactly what random samples are.

In this chapter we will take a more careful look at sampling, with special attention to the properties of large random samples.

Let’s start by drawing some samples. Our examples are based on the top_movies_2017.csv data set.

top1 = Table.read_table(path_data + 'top_movies_2017.csv')
top2 = top1.with_column('Row Index', np.arange(top1.num_rows))
top = top2.move_to_start('Row Index')

top.set_format(make_array(3, 4), NumberFormatter)

Row Index	Title	Studio	Gross	Gross (Adjusted)	Year
0	Gone with the Wind	MGM	198,676,459	1,796,176,700	1939
1	Star Wars	Fox	460,998,007	1,583,483,200	1977
2	The Sound of Music	Fox	158,671,368	1,266,072,700	1965
3	E.T.: The Extra-Terrestrial	Universal	435,110,554	1,261,085,000	1982
4	Titanic	Paramount	658,672,302	1,204,368,000	1997
5	The Ten Commandments	Paramount	65,500,000	1,164,590,000	1956
6	Jaws	Universal	260,000,000	1,138,620,700	1975
7	Doctor Zhivago	MGM	111,721,910	1,103,564,200	1965
8	The Exorcist	Warner Brothers	232,906,145	983,226,600	1973
9	Snow White and the Seven Dwarves	Disney	184,925,486	969,010,000	1937

... (190 rows omitted)

Sampling Rows of a Table

Each row of a data table represents an individual; in top, each individual is a movie. Sampling individuals can thus be achieved by sampling the rows of a table.

The contents of a row are the values of different variables measured on the same individual. So the contents of the sampled rows form samples of values of each of the variables.

Deterministic Samples

When you simply specify which elements of a set you want to choose, without any chances involved, you create a deterministic sample.

You have done this many times, for example by using take:

top.take(make_array(3, 18, 100))

Row Index	Title	Studio	Gross	Gross (Adjusted)	Year
3	E.T.: The Extra-Terrestrial	Universal	435,110,554	1,261,085,000	1982
18	The Lion King	Buena Vista	422,783,777	792,511,700	1994
100	The Hunger Games	Lionsgate	408,010,692	452,174,400	2012

You have also used where:

top.where('Title', are.containing('Harry Potter'))

Row Index	Title	Studio	Gross	Gross (Adjusted)	Year
74	Harry Potter and the Sorcerer's Stone	Warner Brothers	317,575,550	497,066,400	2001
114	Harry Potter and the Deathly Hallows Part 2	Warner Brothers	381,011,219	426,630,300	2011
131	Harry Potter and the Goblet of Fire	Warner Brothers	290,013,036	401,608,200	2005
133	Harry Potter and the Chamber of Secrets	Warner Brothers	261,988,482	399,302,200	2002
154	Harry Potter and the Order of the Phoenix	Warner Brothers	292,004,738	377,314,200	2007
175	Harry Potter and the Half-Blood Prince	Warner Brothers	301,959,197	359,788,300	2009
177	Harry Potter and the Prisoner of Azkaban	Warner Brothers	249,541,069	357,233,500	2004

While these are samples, they are not random samples. They don’t involve chance.

Probability Samples

For describing random samples, some terminology will be helpful.

A population is the set of all elements from whom a sample will be drawn.

A probability sample is one for which it is possible to calculate, before the sample is drawn, the chance with which any subset of elements will enter the sample.

In a probability sample, all elements need not have the same chance of being chosen.

A Random Sampling Scheme

For example, suppose you choose two people from a population that consists of three people A, B, and C, according to the following scheme:

Person A is chosen with probability 1.
One of Persons B or C is chosen according to the toss of a coin: if the coin lands heads, you choose B, and if it lands tails you choose C.

This is a probability sample of size 2. Here are the chances of entry for all non-empty subsets:

A: 1 
B: 1/2
C: 1/2
AB: 1/2
AC: 1/2
BC: 0
ABC: 0

Person A has a higher chance of being selected than Persons B or C; indeed, Person A is certain to be selected. Since these differences are known and quantified, they can be taken into account when working with the sample.

A Systematic Sample

Imagine all the elements of the population listed in a sequence. One method of sampling starts by choosing a random position early in the list, and then evenly spaced positions after that. The sample consists of the elements in those positions. Such a sample is called a systematic sample.

Here we will choose a systematic sample of the rows of top. We will start by picking one of the first 10 rows at random, and then we will pick every 10th row after that.

"""Choose a random start among rows 0 through 9;
then take every 10th row."""

start = np.random.choice(np.arange(10))
top.take(np.arange(start, top.num_rows, 10))

Row Index	Title	Studio	Gross	Gross (Adjusted)	Year
6	Jaws	Universal	260,000,000	1,138,620,700	1975
16	Jurassic Park	Universal	402,453,882	817,186,200	1993
26	Mary Poppins	Disney	102,272,727	695,036,400	1964
36	Love Story	Paramount	106,397,186	622,283,500	1970
46	The Robe	Fox	36,000,000	581,890,900	1953
56	Rogue One: A Star Wars Story	Buena Vista	532,177,324	537,326,000	2016
66	The Dark Knight Rises	Warner Brothers	448,139,099	511,902,300	2012
76	Close Encounters of the Third Kind	Columbia	132,088,635	494,066,600	1977
86	Transformers: Revenge of the Fallen	Paramount/Dreamworks	402,111,870	479,179,200	2009
96	Toy Story 3	Buena Vista	415,004,880	464,074,600	2010

... (10 rows omitted)

Run the cell a few times to see how the output varies.

This systematic sample is a probability sample. In this scheme, all rows have chance \(1/10\) of being chosen. For example, Row 23 is chosen if and only if Row 3 is chosen, and the chance of that is \(1/10\).

But not all subsets have the same chance of being chosen. Because the selected rows are evenly spaced, most subsets of rows have no chance of being chosen. The only subsets that are possible are those that consist of rows all separated by multiples of 10. Any of those subsets is selected with chance 1/10. Other subsets, such as a subset containing both the 15th and 16th rows of the table, or any subset of size more than 10, are selected with chance 0.

Random Samples Drawn With or Without Replacement

In this course, we will mostly deal with the two most straightforward methods of sampling.

The first is random sampling with replacement, which (as we have seen earlier) is the default behavior of np.random.choice when it samples from an array.

The other, called a “simple random sample”, is a sample drawn at random without replacement. Sampled individuals are not replaced in the population before the next individual is drawn. This is the kind of sampling that happens when you deal a hand from a deck of cards, for example. To use np.random.choice for simple random sampling, you must include the argument replace=False.

In this chapter, we will use simulation to study the behavior of large samples drawn at random with or without replacement.

Convenience Samples

Drawing a random sample requires care and precision. It is not haphazard even though that is a colloquial meaning of the word "random". If you stand at a street corner and take as your sample the first ten people who pass by, you might think you're sampling at random because you didn't choose who walked by. But it's not a random sample – it's a *sample of convenience*. You didn't know ahead of time the probability of each person entering the sample; perhaps you hadn't even specified exactly who was in the population.