{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "from datascience import *\n", "%matplotlib inline\n", "path_data = '../../../assets/data/'\n", "import matplotlib.pyplot as plots\n", "plots.style.use('fivethirtyeight')\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Variability\n", "The mean tells us where a histogram balances. But in almost every histogram we have seen, the values spread out on both sides of the mean. How far from the mean can they be? To answer this question, we will develop a measure of variability about the mean.\n", "\n", "We will start by describing how to calculate the measure. Then we will see why it is a good measure to calculate." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Rough Size of Deviations from Average\n", "For simplicity, we will begin our calculations in the context of a simple array any_numbers consisting of just four values. As you will see, our method will extend easily to any other array of values." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "any_numbers = make_array(1, 2, 2, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal is to measure roughly how far off the numbers are from their average. To do this, we first need the average: " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3.75" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Step 1. The average.\n", "\n", "mean = np.mean(any_numbers)\n", "mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's find out how far each value is from the mean. These are called the *deviations from the average*. A \"deviation from average\" is just a value minus the average. The table calculation_steps displays the results." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Value Deviation from Average
1 -2.75
2 -1.75
2 -1.75
10 6.25
" ], "text/plain": [ "Value | Deviation from Average\n", "1 | -2.75\n", "2 | -1.75\n", "2 | -1.75\n", "10 | 6.25" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Step 2. The deviations from average.\n", "\n", "deviations = any_numbers - mean\n", "calculation_steps = Table().with_columns(\n", " 'Value', any_numbers,\n", " 'Deviation from Average', deviations\n", " )\n", "calculation_steps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of the deviations are negative; those correspond to values that are below average. Positive deviations correspond to above-average values.\n", "\n", "To calculate roughly how big the deviations are, it is natural to compute the mean of the deviations. But something interesting happens when all the deviations are added together:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sum(deviations)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The positive deviations exactly cancel out the negative ones. This is true of all lists of numbers, no matter what the histogram of the list looks like: **the sum of the deviations from average is zero.** \n", "\n", "Since the sum of the deviations is 0, the mean of the deviations will be 0 as well:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.mean(deviations)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because of this, the mean of the deviations is not a useful measure of the size of the deviations. What we really want to know is roughly how big the deviations are, regardless of whether they are positive or negative. So we need a way to eliminate the signs of the deviations.\n", "\n", "There are two time-honored ways of losing signs: the absolute value, and the square. It turns out that taking the square constructs a measure with extremely powerful properties, some of which we will study in this course.\n", "\n", "So let's eliminate the signs by squaring all the deviations. Then we will take the mean of the squares:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Value Deviation from Average Squared Deviations from Average
1 -2.75 7.5625
2 -1.75 3.0625
2 -1.75 3.0625
10 6.25 39.0625
" ], "text/plain": [ "Value | Deviation from Average | Squared Deviations from Average\n", "1 | -2.75 | 7.5625\n", "2 | -1.75 | 3.0625\n", "2 | -1.75 | 3.0625\n", "10 | 6.25 | 39.0625" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Step 3. The squared deviations from average\n", "\n", "squared_deviations = deviations ** 2\n", "calculation_steps = calculation_steps.with_column(\n", " 'Squared Deviations from Average', squared_deviations\n", " )\n", "calculation_steps" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "13.1875" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Step 4. Variance = the mean squared deviation from average\n", "\n", "variance = np.mean(squared_deviations)\n", "variance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Variance:** The mean squared deviation calculated above is called the *variance* of the values. \n", "\n", "While the variance does give us an idea of spread, it is not on the same scale as the original variable as its units are the square of the original. This makes interpretation very difficult. \n", "\n", "So we return to the original scale by taking the positive square root of the variance:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3.6314597615834874" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Step 5.\n", "# Standard Deviation: root mean squared deviation from average\n", "# Steps of calculation: 5 4 3 2 1\n", "\n", "sd = variance ** 0.5\n", "sd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Standard Deviation\n", "\n", "The quantity that we have just computed is called the *standard deviation* of the list, and is abbreviated as SD. It measures roughly how far the numbers on the list are from their average.\n", "\n", "**Definition.** The SD of a list is defined as the *root mean square of deviations from average*. That's a mouthful. But read it from right to left and you have the sequence of steps in the calculation.\n", "\n", "**Computation.** The five steps described above result in the SD. You can also use the function np.std to compute the SD of values in an array:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3.6314597615834874" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.std(any_numbers)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with the SD\n", "\n", "To see what we can learn from the SD, let's move to a more interesting dataset than any_numbers. The table nba13 contains data on the players in the National Basketball Association (NBA) in 2013. For each player, the table records the position at which the player usually played, his height in inches, his weight in pounds, and his age in years." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Name Position Height Weight Age in 2013
DeQuan Jones Guard 80 221 23
Darius Miller Guard 80 235 23
Trevor Ariza Guard 80 210 28
James Jones Guard 80 215 32
Wesley Johnson Guard 79 215 26
Klay Thompson Guard 79 205 23
Thabo Sefolosha Guard 79 215 29
Chase Budinger Guard 79 218 25
Kevin Martin Guard 79 185 30
Evan Fournier Guard 79 206 20
\n", "

... (495 rows omitted)