NumPy is great. But it lacks a few things that are conducive to doing statisitcal analysis. By building on top of NumPy, pandas provides
groupby, rolling, resample)This is the typical starting point for any intro to pandas. We'll follow suit.
Here we have the workhorse data structure for pandas. It's an in-memory table holding your data, and provides a few conviniences over lists of lists or NumPy arrays.
import numpy as np
import pandas as pd
# Many ways to construct a DataFrame
# We pass a dict of {column name: column values}
np.random.seed(42)
df = pd.DataFrame({'A': [1, 2, 3],
'B': [True, True, False],
'C': np.random.randn(3)},
index=['a', 'b', 'c']) # also this weird index thing
df
Notice that we can store a column of intergers, a column of booleans, and a column of floats in the same DataFrame.
Our first improvement over numpy arrays is labeled indexing. We can select subsets by column, row, or both. Column selection uses the regular python __getitem__ machinery. Pass in a single column label 'A' or a list of labels ['A', 'C'] to select subsets of the original DataFrame.
# Single column, reduces to a Series
df['A']
cols = ['A', 'C']
df[cols]
For row-wise selection, use the special .loc accessor.
df.loc[['a', 'b']]
You can use ranges to select rows or columns.
df.loc['a':'b']
Notice that the slice is inclusive on both sides, unlike your typical slicing of a list. Sometimes, you'd rather slice by position instead of label. .iloc has you covered:
df.iloc[[0, 2]]
df.iloc[:2]
This follows the usual python slicing rules: closed on the left, open on the right.
As I mentioned, you can slice both rows and columns. Use .loc for label or .iloc for position indexing.
df.loc['a', 'B'], df.iloc[0, 1]
Pandas, like NumPy, will reduce dimensions when possible. Select a single column and you get back Series (see below). Select a single row and single column, you get a scalar.
You can get pretty fancy:
df.loc['a':'b', ['A', 'C']]
[] for selecting columns.loc[row_lables, column_labels] for label-based indexing.iloc[row_positions, column_positions] for positional indexI've left out boolean and hierarchical indexing, which we'll see later.
You've already seen some Series up above. It's the 1-dimensional analog of the DataFrame. Each column in a DataFrame is in some sense a Series. You can select a Series from a DataFrame in a few ways:
# __getitem__ like before
df['A']
# .loc, like before
df.loc[:, 'A']
# using `.` attribute lookup
df.A
df['mean'] = ['a', 'b', 'c']
df['mean']
df.mean
You'll have to be careful with the last one. It won't work if you're column name isn't a valid python identifier (say it has a space) or if it conflicts with one of the (many) methods on DataFrame. The . accessor is extremely convient for interactive use though.
You should never assign a column with . e.g. don't do
# bad
df.A = [1, 2, 3]
It's unclear whether your attaching the list [1, 2, 3] as an attribute of df, or whether you want it as a column. It's better to just say
df['A'] = [1, 2, 3]
# or
df.loc[:, 'A'] = [1, 2, 3]
Series share many of the same methods as DataFrames.
Indexes are something of a peculiarity to pandas.
First off, they are not the kind of indexes you'll find in SQL, which are used to help the engine speed up certain queries.
In pandas, Indexes are about lables. This helps with selection (like we did above) and automatic alignment when performing operations between two DataFrames or Series.
R does have row labels, but they're nowhere near as powerful (or complicated) as in pandas. You can access the index of a DataFrame or Series with the .index attribute.
df.index
df.columns
np.random.seed(42)
df = pd.DataFrame(np.random.uniform(0, 100, size=(3, 3)))
# df = pd.DataFrame(np.random.randn(3, 3))
# df = pd.DataFrame(np.random.random([3, 3]))
df
df + 1
df ** 2
np.log(df)
DataFrames and Series have a bunch of useful aggregation methods, .mean, .max, .std, etc.
df.mean()
df = pd.read_csv('beer_subset.csv.gz', parse_dates=['time'], compression='gzip')
review_cols = ['review_appearance', 'review_aroma', 'review_overall',
'review_palate', 'review_taste']
df.head()
Like a where clause in SQL. The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed.
df.abv < 5
df[df.abv < 5].head()
Notice that we just used [] there. We can pass the boolean indexer in to .loc as well.
df.loc[df.abv < 5, ['beer_style', 'review_overall']].head()
Again, you can get complicated
df[((df.abv < 5) & (df.time > pd.Timestamp('2009-06'))) | (df.review_overall >= 4.5)]
Select just the rows where the beer_style contains 'American'.
Hint: Series containing strings have a bunch of useful methods under the DataFrame.<column>.str namespace. Typically they correspond to regular python string methods, but
We can't use 'American' in df['beer_style'], since in is used to check membership in the series itself, not the strings. But in uses __contains__, so look for a string method like that.
df.beer_style.str.contains("American")
# Your solution
is_ipa = df.beer_style.str.contains("American")
df[is_ipa]
Groupby is a fundamental operation to pandas and data analysis.
The components of a groupby operation are to
In pandas the first step looks like
df.groupby( grouper )
grouper can be many things
df)levels=[ names of levels in a MultiIndex ]gr = df.groupby('beer_style')
gr
Haven't really done anything yet. Just some book-keeping to figure out which keys go with which rows. Keys are the things we've grouped by (each beer_style in this case).
The last two steps, apply and combine, are just:
gr.agg('mean')
df.groupby('beer_style').mean()
This says apply the mean function to each column. Non-numeric columns (nusiance columns) are excluded. We can also select a subset of columns to perform the aggregation on.
gr[review_cols].agg('mean')
. attribute lookup works as well.
gr.abv.agg('mean')
Certain operations are attached directly to the GroupBy object, letting you bypass the .agg part
gr.abv.mean()
Now we'll run the gamut on a bunch of grouper / apply combinations. Keep sight of the target though: split, apply, combine.
IndexMultiIndexSeries (or DataFrame if multiple aggregations)DataFrameIndex in the columsMultiIndex in the columns (Or 1-D Index groupee is 1-d)Multiple Aggregations on one column
gr['review_aroma'].agg(['mean', 'std', 'count']).head()
Single Aggregation on multiple columns
gr[review_cols].mean()
Multiple aggregations on multiple columns
gr[review_cols].agg(['mean', 'count', 'std'])