6.3. Example: Population Trends#
We are now ready to work with large tables of data. The file below contains “Annual Estimates of the Resident Population by Single Year of Age and Sex for the United States.” Notice that read_table
can read data directly from a URL.
# As of August 2021, this census file is online here:
data = 'http://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2019/nc-est2019-agesex-res.csv'
# A local copy can be accessed here in case census.gov moves the file:
# data = path_data + 'nc-est2019-agesex-res.csv'
full_census_table = Table.read_table(data)
full_census_table
SEX | AGE | CENSUS2010POP | ESTIMATESBASE2010 | POPESTIMATE2010 | POPESTIMATE2011 | POPESTIMATE2012 | POPESTIMATE2013 | POPESTIMATE2014 | POPESTIMATE2015 | POPESTIMATE2016 | POPESTIMATE2017 | POPESTIMATE2018 | POPESTIMATE2019 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3944153 | 3944160 | 3951430 | 3963092 | 3926570 | 3931258 | 3954787 | 3983981 | 3954773 | 3893990 | 3815343 | 3783052 |
0 | 1 | 3978070 | 3978090 | 3957730 | 3966225 | 3977549 | 3942698 | 3948891 | 3973133 | 4002903 | 3972711 | 3908830 | 3829599 |
0 | 2 | 4096929 | 4096939 | 4090621 | 3970654 | 3978925 | 3991740 | 3958711 | 3966321 | 3991349 | 4020045 | 3987032 | 3922044 |
0 | 3 | 4119040 | 4119051 | 4111688 | 4101644 | 3981531 | 3991017 | 4005928 | 3974351 | 3982984 | 4006946 | 4033038 | 3998665 |
0 | 4 | 4063170 | 4063186 | 4077346 | 4121488 | 4111490 | 3992502 | 4004032 | 4020292 | 3989750 | 3997280 | 4018719 | 4043323 |
0 | 5 | 4056858 | 4056872 | 4064521 | 4087054 | 4131049 | 4121876 | 4004576 | 4017589 | 4035033 | 4003452 | 4008443 | 4028281 |
0 | 6 | 4066381 | 4066412 | 4072904 | 4074531 | 4096631 | 4141126 | 4133372 | 4017388 | 4031568 | 4048018 | 4014057 | 4017227 |
0 | 7 | 4030579 | 4030594 | 4042990 | 4082821 | 4084175 | 4106756 | 4152666 | 4145872 | 4030888 | 4044139 | 4058370 | 4022319 |
0 | 8 | 4046486 | 4046497 | 4025501 | 4052773 | 4092559 | 4094513 | 4118349 | 4165033 | 4158848 | 4042924 | 4054236 | 4066194 |
0 | 9 | 4148353 | 4148369 | 4125312 | 4035319 | 4062726 | 4103052 | 4106068 | 4130887 | 4177895 | 4170813 | 4053179 | 4061874 |
... (296 rows omitted)
Only the first 10 rows of the table are displayed. Later we will see how to display the entire table; however, this is typically not useful with large tables.
A description of the table appears online.
The SEX
column contains numeric codes: 0
stands for the total, 1
for male, and 2
for female. The assumptions underlying this binary categorization are questionable. We will discuss them in the next section.
In this section we will focus on age and population size.
The AGE
column contains ages in completed years, but the special value 999
is a sum of the total population. The “age” 100
too has a special status. In that category, the Census Bureau includes all people aged 100 or more.
The other columns contain estimates of the US population in each category of sex and age in the years 2010 through 2019. The Census is decennial: it takes place every 10 years. The most recent Census was held in 2020 and the one before that in 2010. The Census Bureau also estimates population changes each year. As explained in the Bureau’s description of its methodology, it “adds [the estimated changes] to the last decennial census to produce updated population estimates every year.”
Typically, a publicly available table will contain more information than necessary for a particular investigation or analysis. To get the large table into a more usable form, we have to do some data cleaning.
Suppose we are only interested in the population changes from 2014 to 2019. Let’s select
the relevant columns.
partial_census_table = full_census_table.select('SEX', 'AGE', 'POPESTIMATE2014', 'POPESTIMATE2019')
partial_census_table
SEX | AGE | POPESTIMATE2014 | POPESTIMATE2019 |
---|---|---|---|
0 | 0 | 3954787 | 3783052 |
0 | 1 | 3948891 | 3829599 |
0 | 2 | 3958711 | 3922044 |
0 | 3 | 4005928 | 3998665 |
0 | 4 | 4004032 | 4043323 |
0 | 5 | 4004576 | 4028281 |
0 | 6 | 4133372 | 4017227 |
0 | 7 | 4152666 | 4022319 |
0 | 8 | 4118349 | 4066194 |
0 | 9 | 4106068 | 4061874 |
... (296 rows omitted)
We can simplify the labels of the selected columns.
us_pop = partial_census_table.relabeled('POPESTIMATE2014', '2014').relabeled('POPESTIMATE2019', '2019')
us_pop
SEX | AGE | 2014 | 2019 |
---|---|---|---|
0 | 0 | 3954787 | 3783052 |
0 | 1 | 3948891 | 3829599 |
0 | 2 | 3958711 | 3922044 |
0 | 3 | 4005928 | 3998665 |
0 | 4 | 4004032 | 4043323 |
0 | 5 | 4004576 | 4028281 |
0 | 6 | 4133372 | 4017227 |
0 | 7 | 4152666 | 4022319 |
0 | 8 | 4118349 | 4066194 |
0 | 9 | 4106068 | 4061874 |
... (296 rows omitted)
6.3.1. Ages 97-100#
As a warm-up, let’s examine the total population, labeled by SEX
code 0. Since all these rows will have the same value 0 in the SEX
column, we will drop that column.
us_pop_by_age = us_pop.where('SEX', are.equal_to(0)).drop('SEX')
us_pop_by_age
AGE | 2014 | 2019 |
---|---|---|
0 | 3954787 | 3783052 |
1 | 3948891 | 3829599 |
2 | 3958711 | 3922044 |
3 | 4005928 | 3998665 |
4 | 4004032 | 4043323 |
5 | 4004576 | 4028281 |
6 | 4133372 | 4017227 |
7 | 4152666 | 4022319 |
8 | 4118349 | 4066194 |
9 | 4106068 | 4061874 |
... (92 rows omitted)
Now let’s look at the population in the highest ages.
us_pop_by_age.where('AGE', are.between(97, 101))
AGE | 2014 | 2019 |
---|---|---|
97 | 82948 | 116969 |
98 | 59546 | 86150 |
99 | 41277 | 57124 |
100 | 70685 | 100322 |
Not surprisingly, the numbers of people are smaller at higher ages. For example, there are fewer 99-year-olds than 98-year-olds.
But the numbers for AGE
100 are quite a bit larger than those for age 99. That is because the row with AGE
100 doesn’t just represent 100-year-olds. It also includes those who are older than 100.
6.3.2. Percent Change#
Each column of the table us_pop_by_age
is an array of the same length, and so columns can be combined using arithmetic. The array below contains the change in population between 2014 and 2019. There is one entry corresponding to each row of us_pop_by_age
.
change = us_pop_by_age.column('2019') - us_pop_by_age.column('2014')
change
array([-171735, -119292, -36667, ..., 15847, 29637, 9938515])
We can augment us_pop_by_age
with a column that contains these changes, both in absolute terms and as percents relative to the value in 2014.
us_pop_change = us_pop_by_age.with_columns(
'Change', change,
'Percent Change', change/us_pop_by_age.column('2014')
)
us_pop_change.set_format('Percent Change', PercentFormatter)
AGE | 2014 | 2019 | Change | Percent Change |
---|---|---|---|---|
0 | 3954787 | 3783052 | -171735 | -4.34% |
1 | 3948891 | 3829599 | -119292 | -3.02% |
2 | 3958711 | 3922044 | -36667 | -0.93% |
3 | 4005928 | 3998665 | -7263 | -0.18% |
4 | 4004032 | 4043323 | 39291 | 0.98% |
5 | 4004576 | 4028281 | 23705 | 0.59% |
6 | 4133372 | 4017227 | -116145 | -2.81% |
7 | 4152666 | 4022319 | -130347 | -3.14% |
8 | 4118349 | 4066194 | -52155 | -1.27% |
9 | 4106068 | 4061874 | -44194 | -1.08% |
... (92 rows omitted)
Almost all the entries displayed in the Percent Change
column are negative, demonstrating a drop in population at the youngest ages. However, the overall population grew by about 9.9 million people, a percent change of just over 3%.
us_pop_change.where('AGE', are.equal_to(999))
AGE | 2014 | 2019 | Change | Percent Change |
---|---|---|---|---|
999 | 318301008 | 328239523 | 9938515 | 3.12% |
Let us compare this to the change at each age. For ease of interpretation, we will sort the table in decreasing order of the absolute change in population, contained in the column Change
.
us_pop_change.where(
'AGE', are.below(999)
).sort('Change', descending=True)
AGE | 2014 | 2019 | Change | Percent Change |
---|---|---|---|---|
72 | 2191642 | 3191048 | 999406 | 45.60% |
68 | 2567511 | 3345475 | 777964 | 30.30% |
69 | 2530460 | 3252423 | 721963 | 28.53% |
70 | 2461426 | 3136704 | 675278 | 27.43% |
71 | 2516392 | 3083083 | 566691 | 22.52% |
76 | 1692960 | 2222392 | 529432 | 31.27% |
62 | 3677408 | 4156645 | 479237 | 13.03% |
28 | 4345247 | 4818725 | 473478 | 10.90% |
64 | 3481789 | 3950578 | 468789 | 13.46% |
38 | 3848856 | 4305576 | 456720 | 11.87% |
... (91 rows omitted)
Take a look at the top few rows. While the percent change is about 3% for the overall population, it jumps to well over 20% for the people in their late sixties and early seventies. This stunning change contributes to what is known as the greying of America.
What could explain this large increase? We can explore this question by examining the years in which the relevant groups were born.
Those who were in the age group 69 to 72 in 2014 were born in the years 1942 to 1945. The attack on Pearl Harbor was in late 1941, and by 1942 U.S. forces were heavily engaged in a massive war that ended in 1945.
Those who were 69 to 72 years old in 2019 were born in the years 1947 to 1950, at the height of the post-WWII baby boom in the United States.
The post-war jump in births is a major reason for the large changes that we have observed.