6.2. Selecting Rows#

Often, we would like to extract just those rows that correspond to entries with a particular feature. For example, we might want only the rows corresponding to the Warriors, or to players who earned more than \(\$10\) million. Or we might just want the top five earners.

6.2.1. Specified Rows#

The Table method take does just that – it takes a specified set of rows. Its argument is a row index or array of indices, and it creates a new table consisting of only those rows.

For example, if we wanted just the first row of nba, we could use take as follows.

nba
PLAYER POSITION TEAM SALARY
Paul Millsap PF Atlanta Hawks 18.6717
Al Horford C Atlanta Hawks 12
Tiago Splitter C Atlanta Hawks 9.75625
Jeff Teague PG Atlanta Hawks 8
Kyle Korver SG Atlanta Hawks 5.74648
Thabo Sefolosha SF Atlanta Hawks 4
Mike Scott PF Atlanta Hawks 3.33333
Kent Bazemore SF Atlanta Hawks 2
Dennis Schroder PG Atlanta Hawks 1.7634
Tim Hardaway Jr. SG Atlanta Hawks 1.30452

... (407 rows omitted)

nba.take(0)
PLAYER POSITION TEAM SALARY
Paul Millsap PF Atlanta Hawks 18.6717

This is a new table with just the single row that we specified.

We could also get the fourth, fifth, and sixth rows by specifying a range of indices as the argument.

nba.take(np.arange(3, 6))
PLAYER POSITION TEAM SALARY
Jeff Teague PG Atlanta Hawks 8
Kyle Korver SG Atlanta Hawks 5.74648
Thabo Sefolosha SF Atlanta Hawks 4

If we want a table of the top 5 highest paid players, we can first sort the list by salary and then take the first five rows:

nba.sort('SALARY', descending=True).take(np.arange(5))
PLAYER POSITION TEAM SALARY
Kobe Bryant SF Los Angeles Lakers 25
Joe Johnson SF Brooklyn Nets 24.8949
LeBron James SF Cleveland Cavaliers 22.9705
Carmelo Anthony SF New York Knicks 22.875
Dwight Howard C Houston Rockets 22.3594

6.2.2. Rows Corresponding to a Specified Feature#

More often, we will want to access data in a set of rows that have a certain feature, but whose indices we don’t know ahead of time. For example, we might want data on all the players who made more than \(\$10\) million, but we don’t want to spend time counting rows in the sorted table.

The method where does the job for us. Its output is a table with the same columns as the original but only the rows where the feature occurs.

The first argument of where is the label of the column that contains the information about whether or not a row has the feature we want. If the feature is “made more than \(\$10\) million”, the column is SALARY.

The second argument of where is a way of specifying the feature. A couple of examples will make the general method of specification easier to understand.

In the first example, we extract the data for all those who earned more than \(\$10\) million.

nba.where('SALARY', are.above(10))
PLAYER POSITION TEAM SALARY
Paul Millsap PF Atlanta Hawks 18.6717
Al Horford C Atlanta Hawks 12
Joe Johnson SF Brooklyn Nets 24.8949
Thaddeus Young PF Brooklyn Nets 11.236
Al Jefferson C Charlotte Hornets 13.5
Nicolas Batum SG Charlotte Hornets 13.1253
Kemba Walker PG Charlotte Hornets 12
Derrick Rose PG Chicago Bulls 20.0931
Jimmy Butler SG Chicago Bulls 16.4075
Joakim Noah C Chicago Bulls 13.4

... (59 rows omitted)

The use of the argument are.above(10) ensured that each selected row had a value of SALARY that was greater than 10.

There are 69 rows in the new table, corresponding to the 69 players who made more than \(10\) million dollars. Arranging these rows in order makes the data easier to analyze. DeMar DeRozan of the Toronto Raptors was the “poorest” of this group, at a salary of just over \(10\) million dollars.

nba.where('SALARY', are.above(10)).sort('SALARY')
PLAYER POSITION TEAM SALARY
DeMar DeRozan SG Toronto Raptors 10.05
Gerald Wallace SF Philadelphia 76ers 10.1059
Luol Deng SF Miami Heat 10.1516
Monta Ellis SG Indiana Pacers 10.3
Wilson Chandler SF Denver Nuggets 10.4494
Brendan Haywood C Cleveland Cavaliers 10.5225
Jrue Holiday PG New Orleans Pelicans 10.5955
Tyreke Evans SG New Orleans Pelicans 10.7346
Marcin Gortat C Washington Wizards 11.2174
Thaddeus Young PF Brooklyn Nets 11.236

... (59 rows omitted)

How much did Stephen Curry make? For the answer, we have to access the row where the value of PLAYER is equal to Stephen Curry. That is placed a table consisting of just one line:

nba.where('PLAYER', are.equal_to('Stephen Curry'))
PLAYER POSITION TEAM SALARY
Stephen Curry PG Golden State Warriors 11.3708

Curry made just under \(\$11.4\) million dollars. That’s a lot of money, but it’s less than half the salary of LeBron James. You’ll find that salary in the “Top 5” table earlier in this section, or you could find it replacing 'Stephen Curry' by 'LeBron James' in the line of code above.

In the code, are is used again, but this time with the predicate equal_to instead of above. Thus for example you can get a table of all the Warriors:

nba.where('TEAM', are.equal_to('Golden State Warriors')).show()
PLAYER POSITION TEAM SALARY
Klay Thompson SG Golden State Warriors 15.501
Draymond Green PF Golden State Warriors 14.2609
Andrew Bogut C Golden State Warriors 13.8
Andre Iguodala SF Golden State Warriors 11.7105
Stephen Curry PG Golden State Warriors 11.3708
Jason Thompson PF Golden State Warriors 7.00847
Shaun Livingston PG Golden State Warriors 5.54373
Harrison Barnes SF Golden State Warriors 3.8734
Marreese Speights C Golden State Warriors 3.815
Leandro Barbosa SG Golden State Warriors 2.5
Festus Ezeli C Golden State Warriors 2.00875
Brandon Rush SF Golden State Warriors 1.27096
Kevon Looney SF Golden State Warriors 1.13196
Anderson Varejao PF Golden State Warriors 0.289755

This portion of the table is already sorted by salary, because the original table listed players sorted by salary within the same team. The .show() at the end of the line ensures that all rows are shown, not just the first 10.

It is so common to ask for the rows for which some column is equal to some value that the are.equal_to call is optional. Instead, the where method can be called with only a column name and a value to achieve the same effect.

nba.where('TEAM', 'Denver Nuggets') # equivalent to nba.where('TEAM', are.equal_to('Denver Nuggets'))
PLAYER POSITION TEAM SALARY
Danilo Gallinari SF Denver Nuggets 14
Kenneth Faried PF Denver Nuggets 11.236
Wilson Chandler SF Denver Nuggets 10.4494
JJ Hickson C Denver Nuggets 5.6135
Jameer Nelson PG Denver Nuggets 4.345
Will Barton SF Denver Nuggets 3.53333
Emmanuel Mudiay PG Denver Nuggets 3.10224
Darrell Arthur PF Denver Nuggets 2.814
Jusuf Nurkic C Denver Nuggets 1.842
Joffrey Lauvergne C Denver Nuggets 1.70972

... (4 rows omitted)

6.2.3. Multiple Features#

You can access rows that have multiple specified features, by using where repeatedly. For example, here is a way to extract all the Point Guards whose salaries were over \(\$15\) million.

nba.where('POSITION', 'PG').where('SALARY', are.above(15))
PLAYER POSITION TEAM SALARY
Derrick Rose PG Chicago Bulls 20.0931
Kyrie Irving PG Cleveland Cavaliers 16.4075
Chris Paul PG Los Angeles Clippers 21.4687
Russell Westbrook PG Oklahoma City Thunder 16.7442
John Wall PG Washington Wizards 15.852

6.2.4. General Form#

By now you will have realized that the general way to create a new table by selecting rows with a given feature is to use where and are with the appropriate condition:

original_table_name.where(column_label_string, are.condition)

nba.where('SALARY', are.between(10, 10.3))
PLAYER POSITION TEAM SALARY
Luol Deng SF Miami Heat 10.1516
Gerald Wallace SF Philadelphia 76ers 10.1059
Danny Green SG San Antonio Spurs 10
DeMar DeRozan SG Toronto Raptors 10.05

Notice that the table above includes Danny Green who made \(\$10\) million, but not Monta Ellis who made \(\$10.3\) million. As elsewhere in Python, the range between includes the left end but not the right.

If we specify a condition that isn’t satisfied by any row, we get a table with column labels but no rows.

nba.where('PLAYER', are.equal_to('Barack Obama'))
PLAYER POSITION TEAM SALARY

6.2.5. Some More Conditions#

Here are some predicates of are that you might find useful. Note that x and y are numbers, STRING is a string, and Z is either a number or a string; you have to specify these depending on the feature you want.

Predicate

Description

are.equal_to(Z)

Equal to Z

are.above(x)

Greater than x

are.above_or_equal_to(x)

Greater than or equal to x

are.below(x)

Less than x

are.below_or_equal_to(x)

Less than or equal to x

are.between(x, y)

Greater than or equal to x, and less than y

are.strictly_between(x, y)

Greater than x and less than y

are.between_or_equal_to(x, y)

Greater than or equal to x, and less than or equal to y

are.containing(S)

Contains the string S

You can also specify the negation of any of these conditions, by using .not_ before the condition:

Predicate

Description

are.not_equal_to(Z)

Not equal to Z

are.not_above(x)

Not above x

… and so on. The usual rules of logic apply – for example, “not above x” is the same as “below or equal to x”.

We end the section with a series of examples.

The use of are.containing can help save some typing. For example, you can just specify Warriors instead of Golden State Warriors:

nba.where('TEAM', are.containing('Warriors')).show()
PLAYER POSITION TEAM SALARY
Klay Thompson SG Golden State Warriors 15.501
Draymond Green PF Golden State Warriors 14.2609
Andrew Bogut C Golden State Warriors 13.8
Andre Iguodala SF Golden State Warriors 11.7105
Stephen Curry PG Golden State Warriors 11.3708
Jason Thompson PF Golden State Warriors 7.00847
Shaun Livingston PG Golden State Warriors 5.54373
Harrison Barnes SF Golden State Warriors 3.8734
Marreese Speights C Golden State Warriors 3.815
Leandro Barbosa SG Golden State Warriors 2.5
Festus Ezeli C Golden State Warriors 2.00875
Brandon Rush SF Golden State Warriors 1.27096
Kevon Looney SF Golden State Warriors 1.13196
Anderson Varejao PF Golden State Warriors 0.289755

You can extract data for all the guards, both Point Guards and Shooting Guards:

nba.where('POSITION', are.containing('G'))
PLAYER POSITION TEAM SALARY
Jeff Teague PG Atlanta Hawks 8
Kyle Korver SG Atlanta Hawks 5.74648
Dennis Schroder PG Atlanta Hawks 1.7634
Tim Hardaway Jr. SG Atlanta Hawks 1.30452
Jason Richardson SG Atlanta Hawks 0.947276
Lamar Patterson SG Atlanta Hawks 0.525093
Terran Petteway SG Atlanta Hawks 0.525093
Avery Bradley PG Boston Celtics 7.73034
Isaiah Thomas PG Boston Celtics 6.91287
Marcus Smart PG Boston Celtics 3.43104

... (171 rows omitted)

You can get all the players who were not Cleveland Cavaliers and had a salary of no less than \(\$20\) million:

other_than_Cavs = nba.where('TEAM', are.not_equal_to('Cleveland Cavaliers'))
other_than_Cavs.where('SALARY', are.not_below(20))
PLAYER POSITION TEAM SALARY
Joe Johnson SF Brooklyn Nets 24.8949
Derrick Rose PG Chicago Bulls 20.0931
Dwight Howard C Houston Rockets 22.3594
Chris Paul PG Los Angeles Clippers 21.4687
Kobe Bryant SF Los Angeles Lakers 25
Chris Bosh PF Miami Heat 22.1927
Dwyane Wade SG Miami Heat 20
Carmelo Anthony SF New York Knicks 22.875
Kevin Durant SF Oklahoma City Thunder 20.1586

The same table can be created in many ways. Here is another, and no doubt you can think of more.

other_than_Cavs.where('SALARY', are.above_or_equal_to(20))
PLAYER POSITION TEAM SALARY
Joe Johnson SF Brooklyn Nets 24.8949
Derrick Rose PG Chicago Bulls 20.0931
Dwight Howard C Houston Rockets 22.3594
Chris Paul PG Los Angeles Clippers 21.4687
Kobe Bryant SF Los Angeles Lakers 25
Chris Bosh PF Miami Heat 22.1927
Dwyane Wade SG Miami Heat 20
Carmelo Anthony SF New York Knicks 22.875
Kevin Durant SF Oklahoma City Thunder 20.1586

As you can see, the use of where with are gives you great flexibility in accessing rows with features that interest you. Don’t hesitate to experiment!