Ethnicity | Eligible | Panels |
---|---|---|

Asian/PI | 0.15 | 0.26 |

Black/AA | 0.18 | 0.08 |

Caucasian | 0.54 | 0.54 |

Hispanic | 0.12 | 0.08 |

Other | 0.01 | 0.04 |

"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"jury.barh('Ethnicity')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Comparison with Panels Selected at Random\n",
"What if we select a random sample of 1,453 people from the population of eligible jurors? Will the distribution of their ethnicities look like the distribution of the panels above?\n",
"\n",
"We can answer these questions by using `sample_proportions` and augmenting the `jury` table with a column of the proportions in our sample.\n",
"\n",
"**Technical note.** Random samples of prospective jurors would be selected without replacement. However, when the size of a sample is small relative to the size of the population, sampling without replacement resembles sampling with replacement; the proportions in the population don't change much between draws. The population of eligible jurors in Alameda County is over a million, and compared to that, a sample size of about 1500 is quite small. We will therefore sample with replacement.\n",
"\n",
"In the cell below, we sample at random 1453 times from the distribution of eligible jurors, and display the distribution of the random sample along with the distributions of the eligible jurors and the panel in the data."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

"
],
"text/plain": [
"Ethnicity | Eligible | Panels | Random Sample\n",
"Asian/PI | 0.15 | 0.26 | 0.14384\n",
"Black/AA | 0.18 | 0.08 | 0.163799\n",
"Caucasian | 0.54 | 0.54 | 0.538197\n",
"Hispanic | 0.12 | 0.08 | 0.143152\n",
"Other | 0.01 | 0.04 | 0.0110117"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eligible_population = jury.column('Eligible')\n",
"sample_distribution = sample_proportions(1453, eligible_population)\n",
"panels_and_sample = jury.with_column('Random Sample', sample_distribution)\n",
"panels_and_sample"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The distribution of the random sample is quite close to the distribution of the eligible population, unlike the distribution of the panels. As always, it helps to visualize."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"

Ethnicity | Eligible | Panels | Random Sample |
---|---|---|---|

Asian/PI | 0.15 | 0.26 | 0.14384 |

Black/AA | 0.18 | 0.08 | 0.163799 |

Caucasian | 0.54 | 0.54 | 0.538197 |

Hispanic | 0.12 | 0.08 | 0.143152 |

Other | 0.01 | 0.04 | 0.0110117 |

"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"panels_and_sample.barh('Ethnicity')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The bar chart shows that the distribution of the random sample resembles the eligible population but the distribution of the panels does not."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To assess whether this observation is particular to one random sample or more general, we can simulate multiple panels under the model of random selection and see what the simulations predict. But we won't be able to look at thousands of bar charts like the one above. We need a statistic that will help us assess whether or not the model or random selection is supported by the data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A New Statistic: The Distance between Two Distributions\n",
"We know how to measure how different two numbers are: if the numbers are $x$ and $y$, the distance between them is $\\vert x-y \\vert$. Now we have to quantify the distance between two distributions. For example, we have to measure the distance between the blue and gold distributions below."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"

"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"jury.barh('Ethnicity')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For this we will compute a quantity called the *total variation distance* between two distributions. The calculation is as an extension of how we find the distance between two numbers.\n",
"\n",
"To compute the total variation distance, we first find the difference between the two proportions in each category."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

"
],
"text/plain": [
"Ethnicity | Eligible | Panels | Difference\n",
"Asian/PI | 0.15 | 0.26 | 0.11\n",
"Black/AA | 0.18 | 0.08 | -0.1\n",
"Caucasian | 0.54 | 0.54 | 0\n",
"Hispanic | 0.12 | 0.08 | -0.04\n",
"Other | 0.01 | 0.04 | 0.03"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Augment the table with a column of differences between proportions\n",
"\n",
"jury_with_diffs = jury.with_column(\n",
" 'Difference', jury.column('Panels') - jury.column('Eligible')\n",
")\n",
"jury_with_diffs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Take a look at the column `Difference` and notice that the sum of its entries is 0: the positive entries add up to 0.14, exactly canceling the total of the negative entries which is -0.14. \n",
"\n",
"This is numerical evidence of the fact that in the bar chart, the gold bars exceed the blue bars by exactly as much as the blue bars exceed the gold. The proportions in each of the two columns ``Panels`` and ``Eligible`` add up to 1, and so the give-and-take between their entries must add up to 0. \n",
"\n",
"To avoid the cancellation, we drop the negative signs and then add all the entries. But this gives us two times the total of the positive entries (equivalently, two times the total of the negative entries, with the sign removed). We don't need that doubling, so we divide the sum by 2."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

"
],
"text/plain": [
"Ethnicity | Eligible | Panels | Difference | Absolute Difference\n",
"Asian/PI | 0.15 | 0.26 | 0.11 | 0.11\n",
"Black/AA | 0.18 | 0.08 | -0.1 | 0.1\n",
"Caucasian | 0.54 | 0.54 | 0 | 0\n",
"Hispanic | 0.12 | 0.08 | -0.04 | 0.04\n",
"Other | 0.01 | 0.04 | 0.03 | 0.03"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"jury_with_diffs = jury_with_diffs.with_column(\n",
" 'Absolute Difference', np.abs(jury_with_diffs.column('Difference'))\n",
")\n",
"\n",
"jury_with_diffs"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.14"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"jury_with_diffs.column('Absolute Difference').sum() / 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This quantity 0.14 is the *total variation distance* (TVD) between the distribution of ethnicities in the eligible juror population and the distribution in the panels.\n",
"\n",
"In general, the total variation distance between two distributions measures how close the distributions are. The larger the TVD, the more different the two distributions appear.\n",
"\n",
"**Technical Note:** We could have obtained the same result by just adding the positive differences. But our method of including all the absolute differences eliminates the need to keep track of which differences are positive and which are not."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will use the total variation distance between distributions as the statistic to simulate under the assumption of random selection. Large values of the distance will be evidence against random selection."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Simulating the Statistic Under the Model\n",
"To see how the TVD varies across random samples, we will simulate it repeatedly under the model of random selection from the eligible population.\n",
"\n",
"Let's organize our calculation. Since we are going to be computing total variation distance repeatedly, we will first write a function that computes it for two given distributions.\n",
"\n",
"The function `total_variation_distance` takes two arrays containing the distributions to compare, and returns the TVD between them."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"def total_variation_distance(distribution_1, distribution_2):\n",
" return sum(np.abs(distribution_1 - distribution_2)) / 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This function will help us calculate our statistic in each repetition of the simulation. But first let's check that it gives the right answer when we use it to compute the distance between the blue (eligible) and gold (panels) distributions above. These are the distribution in the ACLU study."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.14"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"total_variation_distance(jury.column('Panels'), jury.column('Eligible'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This agrees with the value that we computed directly without using the function.\n",
"\n",
"In the cell below we use the function to compute the TVD between the distributions of the eligible jurors and one random sample. Recall that `eligible_population` is the array containing the distribution of the eligible jurors, and that our sample size is 1453.\n",
"\n",
"In the first line, we use `sample_proportions` to generate a random sample from the eligible population. In the next line we use `total_variation_distance` to compute the TVD between the distributions in the random sample and the eligible population."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.018265657260839632"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sample_distribution = sample_proportions(1453, eligible_population)\n",
"total_variation_distance(sample_distribution, eligible_population)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the cell a few times and notice that the distances are quite a bit smaller than 0.14, the distance between the distribution of the panels and the eligible jurors.\n",
"\n",
"We are now ready to run a simulation to assess the model of random selection."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Simulating One Value of the Statistic\n",
"In the same way that we start every simulation, let's define a function `one_simulated_tvd` that returns one simulated value of the total variation distance under the hypothesis of random selection. \n",
"\n",
"The code in the body of the definition is based on the cell above."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# Simulate one simulated value of \n",
"# the total variation distance between\n",
"# the distribution of a sample selected at random\n",
"# and the distribution of the eligible population\n",
"\n",
"def one_simulated_tvd():\n",
" sample_distribution = sample_proportions(1453, eligible_population)\n",
" return total_variation_distance(sample_distribution, eligible_population) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Simulating Multiple Values of the Statistic\n",
"Now we can apply the familiar process of using a `for` loop to create an array consisting of 5000 such distances."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"tvds = make_array()\n",
"repetitions = 5000\n",
"for i in np.arange(repetitions):\n",
" tvds = np.append(tvds, one_simulated_tvd())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Assessing the Model of Random Selection\n",
"\n",
"Here is the empirical histogram of the simulated distances. It shows that if you draw 1453 panelists at random from the pool of eligible candidates, then the distance between the distributions of the panelists and the eligible population is rarely more than about 0.05.\n",
"\n",
"The panels in the study, however, were not quite so similar to the eligible population. The total variation distance between the panels and the population was 0.14, shown as the red dot on the horizontal axis. It is far beyond the tail of the histogram and does not look at all like a typical distance between the distributions of a random sample and the eligible population."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"

Ethnicity | Eligible | Panels | Difference |
---|---|---|---|

Asian/PI | 0.15 | 0.26 | 0.11 |

Black/AA | 0.18 | 0.08 | -0.1 |

Caucasian | 0.54 | 0.54 | 0 |

Hispanic | 0.12 | 0.08 | -0.04 |

Other | 0.01 | 0.04 | 0.03 |

Ethnicity | Eligible | Panels | Difference | Absolute Difference |
---|---|---|---|---|

Asian/PI | 0.15 | 0.26 | 0.11 | 0.11 |

Black/AA | 0.18 | 0.08 | -0.1 | 0.1 |

Caucasian | 0.54 | 0.54 | 0 | 0 |

Hispanic | 0.12 | 0.08 | -0.04 | 0.04 |

Other | 0.01 | 0.04 | 0.03 | 0.03 |