LAB 1C - Pandas¶

Why pandas?¶

NumPy is great. But it lacks a few things that are conducive to doing statisitcal analysis. By building on top of NumPy, pandas provides

labeled arrays
heterogenous data types within a table
"better" missing data handling
convenient methods (groupby, rolling, resample)
more data types (Categorical, Datetime)

Data Structures¶

This is the typical starting point for any intro to pandas. We'll follow suit.

The DataFrame¶

Here we have the workhorse data structure for pandas. It's an in-memory table holding your data, and provides a few conviniences over lists of lists or NumPy arrays.

import numpy as np
import pandas as pd

# Many ways to construct a DataFrame
# We pass a dict of {column name: column values}
np.random.seed(42)
df = pd.DataFrame({'A': [1, 2, 3], 
                   'B': [True, True, False],
                   'C': np.random.randn(3)},
                  index=['a', 'b', 'c'])  # also this weird index thing
df

Notice that we can store a column of intergers, a column of booleans, and a column of floats in the same DataFrame.

Indexing¶

Our first improvement over numpy arrays is labeled indexing. We can select subsets by column, row, or both. Column selection uses the regular python __getitem__ machinery. Pass in a single column label 'A' or a list of labels ['A', 'C'] to select subsets of the original DataFrame.

# Single column, reduces to a Series
df['A']

a    1
b    2
c    3
Name: A, dtype: int64

cols = ['A', 'C']
df[cols]

For row-wise selection, use the special .loc accessor.

df.loc[['a', 'b']]

You can use ranges to select rows or columns.

df.loc['a':'b']

Notice that the slice is inclusive on both sides, unlike your typical slicing of a list. Sometimes, you'd rather slice by position instead of label. .iloc has you covered:

df.iloc[[0, 2]]

df.iloc[:2]

This follows the usual python slicing rules: closed on the left, open on the right.

As I mentioned, you can slice both rows and columns. Use .loc for label or .iloc for position indexing.

df.loc['a', 'B'], df.iloc[0, 1]

(True, True)

Pandas, like NumPy, will reduce dimensions when possible. Select a single column and you get back Series (see below). Select a single row and single column, you get a scalar.

You can get pretty fancy:

df.loc['a':'b', ['A', 'C']]

Summary¶

Use [] for selecting columns
Use .loc[row_lables, column_labels] for label-based indexing
Use .iloc[row_positions, column_positions] for positional index

I've left out boolean and hierarchical indexing, which we'll see later.

Series¶

You've already seen some Series up above. It's the 1-dimensional analog of the DataFrame. Each column in a DataFrame is in some sense a Series. You can select a Series from a DataFrame in a few ways:

# __getitem__ like before
df['A']

a    1
b    2
c    3
Name: A, dtype: int64

# .loc, like before
df.loc[:, 'A']

a    1
b    2
c    3
Name: A, dtype: int64

# using `.` attribute lookup
df.A

a    1
b    2
c    3
Name: A, dtype: int64

df['mean'] = ['a', 'b', 'c']

df['mean']

a    a
b    b
c    c
Name: mean, dtype: object

df.mean

<bound method DataFrame.mean of    A      B         C
a  1   True  0.496714
b  2   True -0.138264
c  3  False  0.647689>

You'll have to be careful with the last one. It won't work if you're column name isn't a valid python identifier (say it has a space) or if it conflicts with one of the (many) methods on DataFrame. The . accessor is extremely convient for interactive use though.

You should never assign a column with . e.g. don't do

# bad
df.A = [1, 2, 3]

It's unclear whether your attaching the list [1, 2, 3] as an attribute of df, or whether you want it as a column. It's better to just say

df['A'] = [1, 2, 3]
# or
df.loc[:, 'A'] = [1, 2, 3]

Series share many of the same methods as DataFrames.

Index¶

Indexes are something of a peculiarity to pandas. First off, they are not the kind of indexes you'll find in SQL, which are used to help the engine speed up certain queries. In pandas, Indexes are about lables. This helps with selection (like we did above) and automatic alignment when performing operations between two DataFrames or Series.

R does have row labels, but they're nowhere near as powerful (or complicated) as in pandas. You can access the index of a DataFrame or Series with the .index attribute.

df.index

Index(['a', 'b', 'c'], dtype='object')

df.columns

Index(['A', 'B', 'C'], dtype='object')

Operations¶

np.random.seed(42)
df = pd.DataFrame(np.random.uniform(0, 100, size=(3, 3)))
# df = pd.DataFrame(np.random.randn(3, 3))
# df = pd.DataFrame(np.random.random([3, 3]))
df

df + 1

df ** 2

np.log(df)

DataFrames and Series have a bunch of useful aggregation methods, .mean, .max, .std, etc.

df.mean()

0    34.376074
1    65.763636
2    49.636782
dtype: float64

Loading Data¶

df = pd.read_csv('beer_subset.csv.gz', parse_dates=['time'], compression='gzip')
review_cols = ['review_appearance', 'review_aroma', 'review_overall',
               'review_palate', 'review_taste']
df.head()

Boolean indexing¶

Like a where clause in SQL. The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed.

df.abv < 5

0      False
1      False
2       True
3      False
4      False
       ...  
994    False
995    False
996    False
997    False
998    False
Name: abv, Length: 999, dtype: bool

df[df.abv < 5].head()

Notice that we just used [] there. We can pass the boolean indexer in to .loc as well.

df.loc[df.abv < 5, ['beer_style', 'review_overall']].head()

Again, you can get complicated

df[((df.abv < 5) & (df.time > pd.Timestamp('2009-06'))) | (df.review_overall >= 4.5)]

Exercise: Find the American beers

Select just the rows where the beer_style contains 'American'.

Hint: Series containing strings have a bunch of useful methods under the DataFrame.<column>.str namespace. Typically they correspond to regular python string methods, but

They gracefully propogate missing values
They're a bit more liberal about accepting regular expressions

We can't use 'American' in df['beer_style'], since in is used to check membership in the series itself, not the strings. But in uses __contains__, so look for a string method like that.

df.beer_style.str.contains("American")

0       True
1       True
2      False
3       True
4       True
       ...  
994     True
995    False
996     True
997    False
998     True
Name: beer_style, Length: 999, dtype: bool

# Your solution
is_ipa = df.beer_style.str.contains("American")
df[is_ipa]

Groupby¶

Groupby is a fundamental operation to pandas and data analysis.

The components of a groupby operation are to

Split a table into groups
Apply a function to each group
Combine the results

In pandas the first step looks like

df.groupby( grouper )

grouper can be many things

Series (or string indicating a column in df)
function (to be applied on the index)
dict : groups by values
levels=[ names of levels in a MultiIndex ]

gr = df.groupby('beer_style')
gr

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fd1e7ec3250>

Haven't really done anything yet. Just some book-keeping to figure out which keys go with which rows. Keys are the things we've grouped by (each beer_style in this case).

The last two steps, apply and combine, are just:

gr.agg('mean')

df.groupby('beer_style').mean()

This says apply the mean function to each column. Non-numeric columns (nusiance columns) are excluded. We can also select a subset of columns to perform the aggregation on.

gr[review_cols].agg('mean')

. attribute lookup works as well.

gr.abv.agg('mean')

beer_style
Altbier                        5.850000
American Adjunct Lager         4.872727
American Amber / Red Ale       6.195652
American Amber / Red Lager     4.822857
American Barleywine           10.208333
                                ...    
Tripel                         9.329412
Vienna Lager                   4.985714
Weizenbock                     8.350000
Wheatwine                     11.075000
Witbier                        6.175000
Name: abv, Length: 93, dtype: float64

Certain operations are attached directly to the GroupBy object, letting you bypass the .agg part

gr.abv.mean()

beer_style
Altbier                        5.850000
American Adjunct Lager         4.872727
American Amber / Red Ale       6.195652
American Amber / Red Lager     4.822857
American Barleywine           10.208333
                                ...    
Tripel                         9.329412
Vienna Lager                   4.985714
Weizenbock                     8.350000
Wheatwine                     11.075000
Witbier                        6.175000
Name: abv, Length: 93, dtype: float64

Now we'll run the gamut on a bunch of grouper / apply combinations. Keep sight of the target though: split, apply, combine.

Grouper: Controls the output index
- single grouper -> Index
- array-like grouper -> MultiIndex
Subject (Groupee): Controls the output data values
- single column -> Series (or DataFrame if multiple aggregations)
- multiple columns -> DataFrame
Aggregation: Controls the output columns
- single aggfunc -> Index in the colums
- multiple aggfuncs -> MultiIndex in the columns (Or 1-D Index groupee is 1-d)

Multiple Aggregations on one column

gr['review_aroma'].agg(['mean', 'std', 'count']).head()

Single Aggregation on multiple columns

gr[review_cols].mean()

Multiple aggregations on multiple columns

gr[review_cols].agg(['mean', 'count', 'std'])

	0	1	2
0	37.454012	95.071431	73.199394
1	59.865848	15.601864	15.599452
2	5.808361	86.617615	60.111501

	0	1	2
0	38.454012	96.071431	74.199394
1	60.865848	16.601864	16.599452
2	6.808361	87.617615	61.111501

	0	1	2
0	1402.803006	9038.576924	5358.151308
1	3583.919807	243.418162	243.342904
2	33.737060	7502.611155	3613.392573

	0	1	2
0	3.623114	4.554629	4.293187
1	4.092106	2.747390	2.747236
2	1.759298	4.461503	4.096201

	abv	beer_id	brewer_id	beer_name	beer_style	review_appearance	review_aroma	review_overall	review_palate	profile_name	review_taste	text	time
0	7.0	2511	287	Bell's Cherry Stout	American Stout	4.5	4.0	4.5	4.0	blaheath	4.5	Batch 8144\tPitch black in color with a 1/2 f...	2009-10-05 21:31:48
1	5.7	19736	9790	Duck-Rabbit Porter	American Porter	4.5	4.0	4.5	4.0	GJ40	4.0	Sampled from a 12oz bottle in a standard pint...	2009-10-05 21:32:09
2	4.8	11098	3182	Fürstenberg Premium Pilsener	German Pilsener	4.0	3.0	3.0	3.0	biegaman	3.5	Haystack yellow with an energetic group of bu...	2009-10-05 21:32:13
3	9.5	28577	3818	Unearthly (Imperial India Pale Ale)	American Double / Imperial IPA	4.0	4.0	4.0	4.0	nick76	4.0	The aroma has pine, wood, citrus, caramel, an...	2009-10-05 21:32:37
4	5.8	398	119	Wolaver's Pale Ale	American Pale Ale (APA)	4.0	3.0	4.0	3.5	champ103	3.0	A: Pours a slightly hazy golden/orange color....	2009-10-05 21:33:14

	beer_style	review_overall
2	German Pilsener	3.0
7	Witbier	4.5
21	Scottish Ale	3.5
22	Märzen / Oktoberfest	4.0
28	Scottish Ale	4.5

	abv	beer_id	brewer_id	review_appearance	review_aroma	review_overall	review_palate	review_taste
beer_style
Altbier	5.850000	43260.500000	419.500000	4.000000	3.750000	4.000000	3.750000	4.000000
American Adjunct Lager	4.872727	12829.909091	2585.909091	2.954545	2.613636	3.272727	2.909091	2.750000
American Amber / Red Ale	6.195652	28366.777778	2531.111111	3.740741	3.592593	3.870370	3.555556	3.777778
American Amber / Red Lager	4.822857	22277.500000	5620.125000	3.437500	3.312500	3.375000	3.187500	3.125000
American Barleywine	10.208333	32457.250000	3744.083333	3.958333	3.937500	3.729167	3.895833	3.937500
...	...	...	...	...	...	...	...	...
Tripel	9.329412	16027.705882	2882.882353	4.264706	4.088235	3.970588	3.911765	4.176471
Vienna Lager	4.985714	19497.750000	6180.750000	3.500000	3.250000	3.375000	3.562500	3.312500
Weizenbock	8.350000	19540.500000	250.000000	4.000000	3.750000	4.250000	4.250000	4.250000
Wheatwine	11.075000	36980.000000	629.600000	3.800000	4.000000	3.500000	4.000000	3.700000
Witbier	6.175000	27346.600000	3583.700000	3.750000	3.650000	3.600000	3.550000	3.650000

	review_appearance			review_aroma			review_overall			review_palate			review_taste
	mean	count	std	mean	count	std	mean	count	std	mean	count	std	mean	count	std
beer_style
Altbier	4.000000	2	0.707107	3.750000	2	0.353553	4.000000	2	0.000000	3.750000	2	0.353553	4.000000	2	0.000000
American Adjunct Lager	2.954545	22	0.722250	2.613636	22	0.596255	3.272727	22	0.667748	2.909091	22	0.478996	2.750000	22	0.631514
American Amber / Red Ale	3.740741	27	0.625890	3.592593	27	0.636049	3.870370	27	0.629294	3.555556	27	0.640513	3.777778	27	0.763763
American Amber / Red Lager	3.437500	8	0.417261	3.312500	8	0.842509	3.375000	8	1.187735	3.187500	8	0.961305	3.125000	8	1.125992
American Barleywine	3.958333	24	0.529903	3.937500	24	0.449940	3.729167	24	0.465766	3.895833	24	0.389514	3.937500	24	0.517362
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Tripel	4.264706	17	0.358715	4.088235	17	0.363803	3.970588	17	0.329326	3.911765	17	0.317967	4.176471	17	0.350944
Vienna Lager	3.500000	8	0.377964	3.250000	8	0.534522	3.375000	8	0.517549	3.562500	8	0.678101	3.312500	8	0.458063
Weizenbock	4.000000	2	0.000000	3.750000	2	0.353553	4.250000	2	0.353553	4.250000	2	0.353553	4.250000	2	0.353553
Wheatwine	3.800000	5	0.273861	4.000000	5	0.353553	3.500000	5	0.353553	4.000000	5	0.000000	3.700000	5	0.447214
Witbier	3.750000	10	0.540062	3.650000	10	0.625833	3.600000	10	0.658281	3.550000	10	0.761942	3.650000	10	0.529675