NumPy is great. But it lacks a few things that are conducive to doing statisitcal analysis. By building on top of NumPy, pandas provides
groupby
, rolling
, resample
)This is the typical starting point for any intro to pandas. We'll follow suit.
Here we have the workhorse data structure for pandas. It's an in-memory table holding your data, and provides a few conviniences over lists of lists or NumPy arrays.
import numpy as np
import pandas as pd
# Many ways to construct a DataFrame
# We pass a dict of {column name: column values}
np.random.seed(42)
df = pd.DataFrame({'A': [1, 2, 3],
'B': [True, True, False],
'C': np.random.randn(3)},
index=['a', 'b', 'c']) # also this weird index thing
df
Notice that we can store a column of intergers, a column of booleans, and a column of floats in the same DataFrame
.
Our first improvement over numpy arrays is labeled indexing. We can select subsets by column, row, or both. Column selection uses the regular python __getitem__
machinery. Pass in a single column label 'A'
or a list of labels ['A', 'C']
to select subsets of the original DataFrame
.
# Single column, reduces to a Series
df['A']
cols = ['A', 'C']
df[cols]
For row-wise selection, use the special .loc
accessor.
df.loc[['a', 'b']]
You can use ranges to select rows or columns.
df.loc['a':'b']
Notice that the slice is inclusive on both sides, unlike your typical slicing of a list. Sometimes, you'd rather slice by position instead of label. .iloc
has you covered:
df.iloc[[0, 2]]
df.iloc[:2]
This follows the usual python slicing rules: closed on the left, open on the right.
As I mentioned, you can slice both rows and columns. Use .loc
for label or .iloc
for position indexing.
df.loc['a', 'B'], df.iloc[0, 1]
Pandas, like NumPy, will reduce dimensions when possible. Select a single column and you get back Series
(see below). Select a single row and single column, you get a scalar.
You can get pretty fancy:
df.loc['a':'b', ['A', 'C']]
[]
for selecting columns.loc[row_lables, column_labels]
for label-based indexing.iloc[row_positions, column_positions]
for positional indexI've left out boolean and hierarchical indexing, which we'll see later.
You've already seen some Series
up above. It's the 1-dimensional analog of the DataFrame. Each column in a DataFrame
is in some sense a Series
. You can select a Series
from a DataFrame in a few ways:
# __getitem__ like before
df['A']
# .loc, like before
df.loc[:, 'A']
# using `.` attribute lookup
df.A
df['mean'] = ['a', 'b', 'c']
df['mean']
df.mean
You'll have to be careful with the last one. It won't work if you're column name isn't a valid python identifier (say it has a space) or if it conflicts with one of the (many) methods on DataFrame
. The .
accessor is extremely convient for interactive use though.
You should never assign a column with .
e.g. don't do
# bad
df.A = [1, 2, 3]
It's unclear whether your attaching the list [1, 2, 3]
as an attribute of df
, or whether you want it as a column. It's better to just say
df['A'] = [1, 2, 3]
# or
df.loc[:, 'A'] = [1, 2, 3]
Series
share many of the same methods as DataFrame
s.
Index
es are something of a peculiarity to pandas.
First off, they are not the kind of indexes you'll find in SQL, which are used to help the engine speed up certain queries.
In pandas, Index
es are about lables. This helps with selection (like we did above) and automatic alignment when performing operations between two DataFrame
s or Series
.
R does have row labels, but they're nowhere near as powerful (or complicated) as in pandas. You can access the index of a DataFrame
or Series
with the .index
attribute.
df.index
df.columns
np.random.seed(42)
df = pd.DataFrame(np.random.uniform(0, 100, size=(3, 3)))
# df = pd.DataFrame(np.random.randn(3, 3))
# df = pd.DataFrame(np.random.random([3, 3]))
df
df + 1
df ** 2
np.log(df)
DataFrames and Series have a bunch of useful aggregation methods, .mean
, .max
, .std
, etc.
df.mean()
df = pd.read_csv('beer_subset.csv.gz', parse_dates=['time'], compression='gzip')
review_cols = ['review_appearance', 'review_aroma', 'review_overall',
'review_palate', 'review_taste']
df.head()
Like a where clause in SQL. The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed.
df.abv < 5
df[df.abv < 5].head()
Notice that we just used []
there. We can pass the boolean indexer in to .loc
as well.
df.loc[df.abv < 5, ['beer_style', 'review_overall']].head()
Again, you can get complicated
df[((df.abv < 5) & (df.time > pd.Timestamp('2009-06'))) | (df.review_overall >= 4.5)]
Select just the rows where the beer_style
contains 'American'
.
Hint: Series
containing strings have a bunch of useful methods under the DataFrame.<column>.str
namespace. Typically they correspond to regular python string methods, but
We can't use 'American' in df['beer_style']
, since in
is used to check membership in the series itself, not the strings. But in
uses __contains__
, so look for a string method like that.
df.beer_style.str.contains("American")
# Your solution
is_ipa = df.beer_style.str.contains("American")
df[is_ipa]
Groupby is a fundamental operation to pandas and data analysis.
The components of a groupby operation are to
In pandas the first step looks like
df.groupby( grouper )
grouper
can be many things
df
)levels=[ names of levels in a MultiIndex ]
gr = df.groupby('beer_style')
gr
Haven't really done anything yet. Just some book-keeping to figure out which keys go with which rows. Keys are the things we've grouped by (each beer_style
in this case).
The last two steps, apply and combine, are just:
gr.agg('mean')
df.groupby('beer_style').mean()
This says apply the mean
function to each column. Non-numeric columns (nusiance columns) are excluded. We can also select a subset of columns to perform the aggregation on.
gr[review_cols].agg('mean')
.
attribute lookup works as well.
gr.abv.agg('mean')
Certain operations are attached directly to the GroupBy
object, letting you bypass the .agg
part
gr.abv.mean()
Now we'll run the gamut on a bunch of grouper / apply combinations. Keep sight of the target though: split, apply, combine.
Index
MultiIndex
Series
(or DataFrame if multiple aggregations)DataFrame
Index
in the columsMultiIndex
in the columns (Or 1-D Index groupee is 1-d)Multiple Aggregations on one column
gr['review_aroma'].agg(['mean', 'std', 'count']).head()
Single Aggregation on multiple columns
gr[review_cols].mean()
Multiple aggregations on multiple columns
gr[review_cols].agg(['mean', 'count', 'std'])