Visual Diagnostics - Computational and Inferential Thinking

Suppose a data scientist has decided to use linear regression to estimate values of one variable (called the response variable) based on another variable (called the predictor). To see how well this method of estimation performs, the data scientist must measure how far off the estimates are from the actual values. These differences are called residuals.

\mbox{residual} ~=~ \mbox{observed value} ~-~ \mbox{regression estimate}

(1)

A residual is what’s left over – the residue – after estimation.

Residuals are the vertical distances of the points from the regression line. There is one residual for each point in the scatter plot. The residual is the difference between the observed value of $y$ and the fitted value of $y$ , so for the point $(x, y)$ ,

\mbox{residual} ~~ = ~~ y ~-~ \mbox{fitted value of }y ~~ = ~~ y ~-~ \mbox{height of regression line at }x

(2)

The function residual calculates the residuals. The calculation assumes all the relevant functions we have already defined: standard_units, correlation, slope, intercept, and fit.

from datascience import *
path_data = '../../../assets/data/'
import numpy as np
from scipy import stats

import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

family_heights = Table.read_table(path_data + 'family_heights.csv')
heights = family_heights.select('midparentHeight', 'childHeight')
heights = heights.relabel(0, 'MidParent').relabel(1, 'Child')
hybrid = Table.read_table(path_data + 'hybrid.csv')

def standard_units(x):
    return (x - np.mean(x))/np.std(x)

def correlation(table, x, y):
    x_in_standard_units = standard_units(table.column(x))
    y_in_standard_units = standard_units(table.column(y))
    return np.mean(x_in_standard_units * y_in_standard_units)

def slope(table, x, y):
    r = correlation(table, x, y)
    return r * np.std(table.column(y))/np.std(table.column(x))

def intercept(table, x, y):
    a = slope(table, x, y)
    return np.mean(table.column(y)) -  a * np.mean(table.column(x))

def fit(table, x, y):
    a = slope(table, x, y)
    b = intercept(table, x, y)
    return a * table.column(x) + b

def residual(table, x, y):
    return table.column(y) - fit(table, x, y)

Continuing our example of estimating the heights of adult children (the response) based on the midparent height (the predictor), let us calculate the fitted values and the residuals.

heights = heights.with_columns(
        'Fitted Value', fit(heights, 'MidParent', 'Child'),
        'Residual', residual(heights, 'MidParent', 'Child')
    )
heights

When there are so many variables to work with, it is always helpful to start with visualization. The function scatter_fit draws the scatter plot of the data, as well as the regression line.

def scatter_fit(table, x, y):
    table.scatter(x, y, s=15)
    plots.plot(table.column(x), fit(table, x, y), lw=4, color='gold')
    plots.xlabel(x)
    plots.ylabel(y)

scatter_fit(heights, 'MidParent', 'Child')

Scatterplot with 'MidParent' on the x-axis and 'Child' on the y-axis. The data points are in dark blue and have a weak to moderate positive association. A gold line is drawn through the center of the data with a positive slope.

A residual plot can be drawn by plotting the residuals against the predictor variable. The function residual_plot does just that.

def residual_plot(table, x, y):
    x_array = table.column(x)
    t = Table().with_columns(
            x, x_array,
            'residuals', residual(table, x, y)
        )
    t.scatter(x, 'residuals', color='r')
    xlims = make_array(min(x_array), max(x_array))
    plots.plot(xlims, make_array(0, 0), color='darkblue', lw=4)
    plots.title('Residual Plot')

residual_plot(heights, 'MidParent', 'Child')

Scatterplot labeled 'Residual Plot' with 'MidParent' on the x-axis and 'residuals' on the y-axis. The x-axis has the same range as in the previous graph, x=64 to x=74, but the y-axis now ranges from y=-10 to y=10. The data points are in red and have no association; they are a blob. A dark blue horizontal line is drawn at y=0.

The midparent heights are on the horizontal axis, as in the original scatter plot. But now the vertical axis shows the residuals. Notice that the plot appears to be centered around the horizontal line at the level 0 (shown in dark blue). Notice also that the plot shows no upward or downward trend. We will observe later that this lack of trend is true of all regressions.

15.5.1Regression Diagnostics¶

Residual plots help us make visual assessments of the quality of a linear regression analysis. Such assessments are called diagnostics. The function regression_diagnostic_plots draws the original scatter plot as well as the residual plot for ease of comparison.

def regression_diagnostic_plots(table, x, y):
    scatter_fit(table, x, y)
    residual_plot(table, x, y)

regression_diagnostic_plots(heights, 'MidParent', 'Child')

A previous scatterplot is reproduced here without changes: Scatterplot with 'MidParent' on the x-axis and 'Child' on the y-axis. The data points are in dark blue and have a weak to moderate positive association. A gold line is drawn through the center of the data with a positive slope.

A previous scatterplot is reproduced here without changes: Scatterplot labeled 'Residual Plot' with 'MidParent' on the x-axis and 'residuals' on the y-axis. The x-axis has the same range as in the previous graph, x=64 to x=74, but the y-axis now ranges from y=-10 to y=10. The data points are in red and have no association; they are a blob. A dark blue horizontal line is drawn at y=0.

This residual plot indicates that linear regression was a reasonable method of estimation. Notice how the residuals are distributed fairly symmetrically above and below the horizontal line at 0, corresponding to the original scatter plot being roughly symmetrical above and below. Notice also that the vertical spread of the plot is fairly even across the most common values of the children’s heights. In other words, apart from a few outlying points, the plot isn’t narrower in some places and wider in others.

In other words, the accuracy of the regression appears to be about the same across the observed range of the predictor variable.

The residual plot of a good regression shows no pattern. The residuals look about the same, above and below the horizontal line at 0, across the range of the predictor variable.

15.5.2Detecting Nonlinearity¶

Drawing the scatter plot of the data usually gives an indication of whether the relation between the two variables is non-linear. Often, however, it is easier to spot non-linearity in a residual plot than in the original scatter plot. This is usually because of the scales of the two plots: the residual plot allows us to zoom in on the errors and hence makes it easier to spot patterns.

A dugong; a large marine mammal pictured underwater.

Our data are a dataset on the age and length of dugongs, which are marine mammals related to manatees and sea cows (image from Wikimedia Commons). The data are in a table called dugong. Age is measured in years and length in meters. Because dugongs tend not to keep track of their birthdays, ages are estimated based on variables such as the condition of their teeth.

dugong = Table.read_table(path_data + 'dugongs.csv')
dugong = dugong.move_to_start('Length')
dugong

If we could measure the length of a dugong, what could we say about its age? Let’s examine what our data say. Here is a regression of age (the response) on length (the predictor). The correlation between the two variables is substantial, at 0.83.

correlation(dugong, 'Length', 'Age')

0.8296474554905714

High correlation notwithstanding, the plot shows a curved pattern that is much more visible in the residual plot.

regression_diagnostic_plots(dugong, 'Length', 'Age')

Scatterplot with 'Length' on the x-axis and 'Age' on the y-axis. There is a fairly strong positive assocation; as x values increase so do y values. Some outliers exist for data points with large x values that have higher y values. A gold line is drawn through the data; the outliers are far from the gold line.

Scatterplot titled 'Residual Plot' with 'Length' on the x-axis and 'residuals' on the y-axis. The y-axis ranges from -5 to 15. Data points are in red, and tend to have y values lower than 5, but two data points with high x values have y values greater than 5. A dark blue horizontal line sits above the center of the data at y=0, excluding the outliers.

While you can spot the non-linearity in the original scatter, it is more clearly evident in the residual plot.

At the low end of the lengths, the residuals are almost all positive; then they are almost all negative; then positive again at the high end of lengths. In other words the regression estimates have a pattern of being too high, then too low, then too high. That means it would have been better to use a curve instead of a straight line to estimate the ages.

When a residual plot shows a pattern, there may be a non-linear relation between the variables.

15.5.3Detecting Heteroscedasticity¶

Heteroscedasticity is a word that will surely be of interest to those who are preparing for Spelling Bees. For data scientists, its interest lies in its meaning, which is “uneven spread”.

Recall the table hybrid that contains data on hybrid cars in the U.S. Here is a regression of fuel efficiency on the rate of acceleration. The association is negative: cars that accelerate quickly tend to be less efficient.

regression_diagnostic_plots(hybrid, 'acceleration', 'mpg')

Scatterplot with 'acceleration' on the x-axis and 'mpg' on the y-axis. Data points are in dark blue and data points with smaller x values tend to have a wider range of possible y values. Data points with larger x values tend to have smaller y values. A gold line is drawn with a negative slope through the data.

Scatterplot titled 'Residual Plot' with 'acceleration' on the x-axis and 'residuals' on the y-axis. A horizontal dark blue line is drawn at y=0. On the left hand side of the plot, the red data points are further away from the dark blue line, both above and below it, than the red data points on the right hand side of the graph.

Notice how the residual plot flares out towards the low end of the accelerations. In other words, the variability in the size of the errors is greater for low values of acceleration than for high values. Uneven variation is often more easily noticed in a residual plot than in the original scatter plot.

If the residual plot shows uneven variation about the horizontal line at 0, the regression estimates are not equally accurate across the range of the predictor variable.

15.5 Visual Diagnostics

15.5.1Regression Diagnostics¶

15.5.2Detecting Nonlinearity¶

15.5.3Detecting Heteroscedasticity¶