An important aspect of data science is to find out what data can tell us about the future. What do data about climate and pollution say about temperatures a few decades from now? Based on a person’s internet profile, which websites are likely to interest them? How can a patient’s medical history be used to judge how well he or she will respond to a treatment?
To answer such questions, data scientists have developed methods for making predictions. In this chapter we will study one of the most commonly used ways of predicting the value of one variable based on the value of another.
Here is a historical dataset used for the prediction of the heights of adults based on the heights of their parents. We have studied this dataset in an earlier section. The table
heights contains data on the midparent height and child’s height (all in inches) for a population of 934 adult “children”. Recall that the midparent height is an average of the heights of the two parents.
# Data on heights of parents and their adult children original = Table.read_table(path_data + 'family_heights.csv') heights = Table().with_columns( 'MidParent', original.column('midparentHeight'), 'Child', original.column('childHeight') )
... (924 rows omitted)
A primary reason for studying the data was to be able to predict the adult height of a child born to parents who were similar to those in the dataset. We made these predictions in Section 8.1, after noticing the positive association between the two variables.
Our approach was to base the prediction on all the points that correspond to a midparent height of around the midparent height of the new person. To do this, we wrote a function called
predict_child which takes a midparent height as its argument and returns the average height of all the children who had midparent heights within half an inch of the argument.
def predict_child(mpht): """Return a prediction of the height of a child whose parents have a midparent height of mpht. The prediction is the average height of the children whose midparent height is in the range mpht plus or minus 0.5 inches. """ close_points = heights.where('MidParent', are.between(mpht-0.5, mpht + 0.5)) return close_points.column('Child').mean()
We applied the function to the column of
Midparent heights, and visualized the result.
# Apply predict_child to all the midparent heights heights_with_predictions = heights.with_column( 'Prediction', heights.apply(predict_child, 'MidParent') )
# Draw the original scatter plot along with the predicted values heights_with_predictions.scatter('MidParent')
The prediction at a given midparent height lies roughly at the center of the vertical strip of points at the given height. This method of prediction is called regression. Later in this chapter we will see whether we can avoid our arbitrary definitions of “closeness” being “within 0.5 inches”. But first we will develop a measure that can be used in many settings to decide how good one variable will be as a predictor of another.