Data Analysis

Adapted from IBM Cognitive Class Series

Import libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

pd.set_option('display.max_columns', 100)
In [2]:
filename = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv"

Python list headers containing name of headers

In [3]:
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

Use the Pandas method read_csv() to load the data from the web address. Set the parameter "names" equal to the Python list "headers".

In [4]:
df = pd.read_csv(filename, names = headers)

Use the method head() to display the first 10 rows of the dataframe.

In [5]:
df.head(n=10)
Out[5]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base length width height curb-weight engine-type num-of-cylinders engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 ? alfa-romero gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 ? alfa-romero gas std two hatchback rwd front 94.5 171.2 65.5 52.4 2823 ohcv six 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 176.6 66.2 54.3 2337 ohc four 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 176.6 66.4 54.3 2824 ohc five 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450
5 2 ? audi gas std two sedan fwd front 99.8 177.3 66.3 53.1 2507 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 15250
6 1 158 audi gas std four sedan fwd front 105.8 192.7 71.4 55.7 2844 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 17710
7 1 ? audi gas std four wagon fwd front 105.8 192.7 71.4 55.7 2954 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 18920
8 1 158 audi gas turbo four sedan fwd front 105.8 192.7 71.4 55.9 3086 ohc five 131 mpfi 3.13 3.40 8.3 140 5500 17 20 23875
9 0 ? audi gas turbo two hatchback 4wd front 99.5 178.2 67.9 52.0 3053 ohc five 131 mpfi 3.13 3.40 7.0 160 5500 16 22 ?

Use info() method to see basic information about the dataset

In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  205 non-null    object 
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       205 non-null    object 
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    object 
 16  engine-size        205 non-null    int64  
 17  fuel-system        205 non-null    object 
 18  bore               205 non-null    object 
 19  stroke             205 non-null    object 
 20  compression-ratio  205 non-null    float64
 21  horsepower         205 non-null    object 
 22  peak-rpm           205 non-null    object 
 23  city-mpg           205 non-null    int64  
 24  highway-mpg        205 non-null    int64  
 25  price              205 non-null    object 
dtypes: float64(5), int64(5), object(16)
memory usage: 41.8+ KB

Data Wrangling

Data Wrangling is the process of converting data from the initial format to a format that may be better for analysis.

As we can see, several question marks appeared in the dataframe; those are missing values which may hinder our further analysis. So, how do we identify all those missing values and deal with them?

Steps for working with missing data:

  1. Identify missing data
  2. Deal with missing data
  3. Correct data format

Identifying missing values

Convert "?" to NaN

In the car dataset, missing data comes with the question mark "?". We replace "?" with NaN (Not a Number), which is Python's default missing value marker, for reasons of computational speed and convenience. Here we use the function:

.replace(A, B, inplace = True) 
to replace A by B

In [7]:
# replace "?" with NaN
df.replace("?", np.nan, inplace = True)
df.head(5)
Out[7]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base length width height curb-weight engine-type num-of-cylinders engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 NaN alfa-romero gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 NaN alfa-romero gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 NaN alfa-romero gas std two hatchback rwd front 94.5 171.2 65.5 52.4 2823 ohcv six 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 176.6 66.2 54.3 2337 ohc four 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 176.6 66.4 54.3 2824 ohc five 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450

Count missing values in each column

The missing values are converted to Python's default. We use Python's built-in functions to identify these missing values. There are two methods to detect missing data:

  1. .isnull()
  2. .notnull()
The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.

In [8]:
df.isnull().head(5)
Out[8]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base length width height curb-weight engine-type num-of-cylinders engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 False True False False False False False False False False False False False False False False False False False False False False False False False False
1 False True False False False False False False False False False False False False False False False False False False False False False False False False
2 False True False False False False False False False False False False False False False False False False False False False False False False False False
3 False False False False False False False False False False False False False False False False False False False False False False False False False False
4 False False False False False False False False False False False False False False False False False False False False False False False False False False

"True" stands for missing value, while "False" stands for not missing value.

In [9]:
missing_count = df.isnull().sum()
missing_count[missing_count > 0]
Out[9]:
normalized-losses    41
num-of-doors          2
bore                  4
stroke                4
horsepower            2
peak-rpm              2
price                 4
dtype: int64

Based on the summary above, each column has 205 rows of data, seven columns containing missing data:

  1. "normalized-losses": 41 missing data
  2. "num-of-doors": 2 missing data
  3. "bore": 4 missing data
  4. "stroke" : 4 missing data
  5. "horsepower": 2 missing data
  6. "peak-rpm": 2 missing data
  7. "price": 4 missing data

Dealing with missing data

How to deal with missing data?

  1. drop data
    a. drop the whole row
    b. drop the whole column
  2. replace data
    a. replace it by mean
    b. replace it by frequency
    c. replace it based on other functions

Whole columns should be dropped only if most entries in the column are empty. In our dataset, none of the columns are empty enough to drop entirely. We have some freedom in choosing which method to replace data; however, some methods may seem more reasonable than others. We will apply each method to many different columns:

Replace by mean:

  • "normalized-losses": 41 missing data, replace them with mean
  • "stroke": 4 missing data, replace them with mean
  • "bore": 4 missing data, replace them with mean
  • "horsepower": 2 missing data, replace them with mean
  • "peak-rpm": 2 missing data, replace them with mean

Replace by frequency:

  • "num-of-doors": 2 missing data, replace them with "four".
    • Reason: 84% sedans is four doors. Since four doors is most frequent, it is most likely to occur

Drop the whole row:

  • "price": 4 missing data, simply delete the whole row
    • Reason: price is what we want to predict. Any data entry without price data cannot be used for prediction; therefore any row now without price data is not useful to us
In [10]:
# Calculate mean for column normalized-losses
# df["normalized-losses"].mean()

Correct data format

In Pandas, we use

.dtype() to check the data type

.astype() to change the data type

Lets list the data types and number of unique values for each column

In [11]:
df_dtype_nunique = pd.concat([df.dtypes, df.nunique()],axis=1)
df_dtype_nunique.columns = ["dtype","unique"]
df_dtype_nunique
Out[11]:
dtype unique
symboling int64 6
normalized-losses object 51
make object 22
fuel-type object 2
aspiration object 2
num-of-doors object 2
body-style object 5
drive-wheels object 3
engine-location object 2
wheel-base float64 53
length float64 75
width float64 44
height float64 49
curb-weight int64 171
engine-type object 7
num-of-cylinders object 7
engine-size int64 44
fuel-system object 8
bore object 38
stroke object 36
compression-ratio float64 32
horsepower object 59
peak-rpm object 23
city-mpg int64 29
highway-mpg int64 30
price object 186

As we can see above, some columns are not of the correct data type. Numerical variables should have type 'float' or 'int', and variables with strings such as categories should have type 'object'. For example, 'bore' and 'stroke' variables are numerical values that describe the engines, so we should expect them to be of the type 'float' or 'int'; however, they are shown as type 'object'. We have to convert data types into a proper format for each column using the "astype()" method.

In [12]:
df.head()
Out[12]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base length width height curb-weight engine-type num-of-cylinders engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 NaN alfa-romero gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 NaN alfa-romero gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 NaN alfa-romero gas std two hatchback rwd front 94.5 171.2 65.5 52.4 2823 ohcv six 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 176.6 66.2 54.3 2337 ohc four 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 176.6 66.4 54.3 2824 ohc five 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450

Convert data types to proper format

In [13]:
numerical_features = ["normalized-losses","stroke","bore","horsepower","peak-rpm","price"]
df[numerical_features] = df[numerical_features].astype("float")

Let us list the columns after the conversion

In [14]:
df.dtypes
Out[14]:
symboling              int64
normalized-losses    float64
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower           float64
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64
dtype: object

Dropping rows with "NaN"

Let's drop all rows that do not have price data:

In [15]:
# simply drop whole row with NaN in "price" column
df.dropna(subset=["price"], axis=0, inplace=True)

# reset index, because we droped two rows
df.reset_index(drop=True, inplace=True)

Calculate the average of the "normalized-losses" column

In [16]:
avg_norm_loss = df["normalized-losses"].mean()
print("Average of normalized-losses:", avg_norm_loss)
Average of normalized-losses: 122.0

Replace "NaN" by mean value in "normalized-losses" column

In [17]:
df["normalized-losses"].fillna(value=avg_norm_loss, inplace=True)
# OR
# df["normalized-losses"].replace(np.nan, avg_norm_loss, inplace=True)

Replacing "NaN" by mean value for all numeric features in one go

In [18]:
df.fillna(value=df.mean(),inplace=True)

Replacing "NaN" with mode (most frequent) for categorical features

To see which values are present in a particular column, we can use the ".value_counts()" method:

In [19]:
df['num-of-doors'].value_counts()
Out[19]:
four    113
two      86
Name: num-of-doors, dtype: int64

We can see that four doors are the most common type. We can also use the ".idxmax()" method to calculate for us the most common type automatically:

In [20]:
df['num-of-doors'].value_counts().idxmax()
Out[20]:
'four'

The replacement procedure is very similar to what we have seen previously

In [21]:
#replace the missing 'num-of-doors' values by the most frequent 
df["num-of-doors"].replace(np.nan, "four", inplace=True)
In [22]:
df.head()
Out[22]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base length width height curb-weight engine-type num-of-cylinders engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 122.0 alfa-romero gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111.0 5000.0 21 27 13495.0
1 3 122.0 alfa-romero gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111.0 5000.0 21 27 16500.0
2 1 122.0 alfa-romero gas std two hatchback rwd front 94.5 171.2 65.5 52.4 2823 ohcv six 152 mpfi 2.68 3.47 9.0 154.0 5000.0 19 26 16500.0
3 2 164.0 audi gas std four sedan fwd front 99.8 176.6 66.2 54.3 2337 ohc four 109 mpfi 3.19 3.40 10.0 102.0 5500.0 24 30 13950.0
4 2 164.0 audi gas std four sedan 4wd front 99.4 176.6 66.4 54.3 2824 ohc five 136 mpfi 3.19 3.40 8.0 115.0 5500.0 18 22 17450.0
In [23]:
df.isnull().any().any()
Out[23]:
False

Good! Now, we obtain the dataset with no missing values.

In [ ]:
 

Data Analysis

Descriptive Statistical Analysis

Let's first take a look at the variables by utilizing a description method.

The describe function automatically computes basic statistics for all continuous variables. Any NaN values are automatically skipped in these statistics.

This will show:

  • the count of that variable
  • the mean
  • the standard deviation (std)
  • the minimum value
  • the IQR (Interquartile Range: 25%, 50% and 75%)
  • the maximum value

We can apply the method "describe" as follows:

In [24]:
df.describe()
Out[24]:
symboling normalized-losses wheel-base length width height curb-weight engine-size bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
count 201.000000 201.00000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000
mean 0.840796 122.00000 98.797015 174.200995 65.889055 53.766667 2555.666667 126.875622 3.330711 3.256904 10.164279 103.396985 5117.587940 25.179104 30.686567 13207.129353
std 1.254802 31.99625 6.066366 12.322175 2.101471 2.447822 517.296727 41.546834 0.268072 0.316048 4.004965 37.365602 478.113178 6.423220 6.815150 7947.066342
min -2.000000 65.00000 86.600000 141.100000 60.300000 47.800000 1488.000000 61.000000 2.540000 2.070000 7.000000 48.000000 4150.000000 13.000000 16.000000 5118.000000
25% 0.000000 101.00000 94.500000 166.800000 64.100000 52.000000 2169.000000 98.000000 3.150000 3.110000 8.600000 70.000000 4800.000000 19.000000 25.000000 7775.000000
50% 1.000000 122.00000 97.000000 173.200000 65.500000 54.100000 2414.000000 120.000000 3.310000 3.290000 9.000000 95.000000 5117.587940 24.000000 30.000000 10295.000000
75% 2.000000 137.00000 102.400000 183.500000 66.600000 55.500000 2926.000000 141.000000 3.580000 3.410000 9.400000 116.000000 5500.000000 30.000000 34.000000 16500.000000
max 3.000000 256.00000 120.900000 208.100000 72.000000 59.800000 4066.000000 326.000000 3.940000 4.170000 23.000000 262.000000 6600.000000 49.000000 54.000000 45400.000000

The default setting of "describe" skips variables of type object. We can apply the method "describe" on the variables of type 'object' as follows:

In [25]:
df.describe(include='object')
Out[25]:
make fuel-type aspiration num-of-doors body-style drive-wheels engine-location engine-type num-of-cylinders fuel-system
count 201 201 201 201 201 201 201 201 201 201
unique 22 2 2 2 5 3 2 6 7 8
top toyota gas std four sedan fwd front ohc four mpfi
freq 32 181 165 115 94 118 198 145 157 92

Grouping

The "groupby" method groups data by different categories. The data is grouped based on one or several variables and analysis is performed on the individual groups.

For example, let's group by the variable "drive-wheels". We see that there are 3 different categories of drive wheels.

In [26]:
df['drive-wheels'].unique()
Out[26]:
array(['rwd', 'fwd', '4wd'], dtype=object)

If we want to know, on average, which type of drive wheel is most valuable, we can group "drive-wheels" and then average them.

We can select the columns 'drive-wheels', 'body-style' and 'price', then assign it to the variable "df_group_one".

In [27]:
df_group = df[['drive-wheels','body-style','price']]

We can then calculate the average price for each of the different categories of data.

In [28]:
# Use groupby to calculate average price for each category of drive-wheels
grouped_test1 = df_group.groupby(['drive-wheels'],as_index=False).mean()
grouped_test1
Out[28]:
drive-wheels price
0 4wd 10241.000000
1 fwd 9244.779661
2 rwd 19757.613333

From our data, it seems rear-wheel drive vehicles are, on average, the most expensive, while 4-wheel and front-wheel are approximately the same in price.

You can also group with multiple variables. For example, let's group by both 'drive-wheels' and 'body-style'. This groups the dataframe by the unique combinations 'drive-wheels' and 'body-style'. We can store the results in the variable 'grouped_test1'.

In [29]:
# Use groupby to calculate average price for each unique combination of category of drive-wheels
grouped_test2 = df_group.groupby(['drive-wheels','body-style'],as_index=False).mean()
grouped_test2
Out[29]:
drive-wheels body-style price
0 4wd hatchback 7603.000000
1 4wd sedan 12647.333333
2 4wd wagon 9095.750000
3 fwd convertible 11595.000000
4 fwd hardtop 8249.000000
5 fwd hatchback 8396.387755
6 fwd sedan 9811.800000
7 fwd wagon 9997.333333
8 rwd convertible 23949.600000
9 rwd hardtop 24202.714286
10 rwd hatchback 14337.777778
11 rwd sedan 21711.833333
12 rwd wagon 16994.222222

This grouped data is much easier to visualize when it is made into a pivot table. A pivot table is like an Excel spreadsheet, with one variable along the column and another along the row. We can convert the dataframe to a pivot table using the method "pivot " to create a pivot table from the groups.

In this case, we will leave the drive-wheel variable as the rows of the table, and pivot body-style to become the columns of the table:

In [30]:
grouped_pivot = grouped_test2.pivot(index='drive-wheels',columns='body-style')
grouped_pivot
Out[30]:
price
body-style convertible hardtop hatchback sedan wagon
drive-wheels
4wd NaN NaN 7603.000000 12647.333333 9095.750000
fwd 11595.0 8249.000000 8396.387755 9811.800000 9997.333333
rwd 23949.6 24202.714286 14337.777778 21711.833333 16994.222222

Often, we won't have data for some of the pivot cells. We can fill these missing cells with the value 0, but any other value could potentially be used as well. It should be mentioned that missing data is quite a complex subject and is an entire course on its own.

In [31]:
grouped_pivot = grouped_pivot.fillna(0) #fill missing values with 0
grouped_pivot
Out[31]:
price
body-style convertible hardtop hatchback sedan wagon
drive-wheels
4wd 0.0 0.000000 7603.000000 12647.333333 9095.750000
fwd 11595.0 8249.000000 8396.387755 9811.800000 9997.333333
rwd 23949.6 24202.714286 14337.777778 21711.833333 16994.222222

Data Visualization

When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable.

In [32]:
# List the data types for each column
print(df.dtypes)
symboling              int64
normalized-losses    float64
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower           float64
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64
dtype: object

Continuous numerical variables:

Continuous numerical variables are variables that may contain any value within some range. Continuous numerical variables can have the type "int64" or "float64". A great way to visualize these variables is by using scatterplots with fitted lines.

In order to start understanding the (linear) relationship between an individual variable and the price. We can do this by using "regplot", which plots the scatterplot plus the fitted regression line for the data.

Let's see several examples of different linear relationships:

Positive linear relationship

Let's find the scatterplot of "engine-size" and "price"

In [33]:
# Engine size as potential predictor variable of price
sns.regplot(x="engine-size", y="price", data=df)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5318983590>

As the engine-size goes up, the price goes up: this indicates a positive direct correlation between these two variables. Engine size seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.

We can examine the correlation between 'engine-size' and 'price' and see it's approximately 0.86

In [34]:
df[["engine-size", "price"]].corr()
Out[34]:
engine-size price
engine-size 1.000000 0.872335
price 0.872335 1.000000

Highway mpg is a potential predictor variable of price

In [35]:
sns.regplot(x="highway-mpg", y="price", data=df)
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f53168d2310>

As the highway-mpg goes up, the price goes down: this indicates an inverse/negative relationship between these two variables. Highway mpg could potentially be a predictor of price.

We can examine the correlation between 'highway-mpg' and 'price' and see it's approximately -0.7

In [36]:
df[['highway-mpg', 'price']].corr()
Out[36]:
highway-mpg price
highway-mpg 1.000000 -0.704692
price -0.704692 1.000000

Weak Linear Relationship

Let's see if "Peak-rpm" as a predictor variable of "price".

In [37]:
sns.regplot(x="peak-rpm", y="price", data=df)
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f53168c6f50>

Peak rpm does not seem like a good predictor of the price at all since the regression line is close to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore it's it is not a reliable variable.

We can examine the correlation between 'peak-rpm' and 'price' and see it's approximately -0.1

In [38]:
df[['peak-rpm','price']].corr()
Out[38]:
peak-rpm price
peak-rpm 1.000000 -0.101542
price -0.101542 1.000000
In [39]:
# Find the correlation between x="stroke", y="price"
df[["stroke","price"]].corr()
Out[39]:
stroke price
stroke 1.000000 0.082267
price 0.082267 1.000000
In [40]:
# Given the correlation results between "price" and "stroke" do you expect a linear relationship?
# Verify your results using the function "regplot()".
sns.regplot(x="stroke", y="price", data=df)
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5316756e90>

Categorical variables

These are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type "object" or "int64". A good way to visualize categorical variables is by using boxplots.

Let's look at the relationship between "body-style" and "price".

In [41]:
sns.boxplot(x="body-style", y="price", data=df)
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f531674edd0>

We see that the distributions of price between the different body-style categories have a significant overlap, and so body-style would not be a good predictor of price. Let's examine engine "engine-location" and "price":

In [42]:
sns.boxplot(x="engine-location", y="price", data=df)
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f53166200d0>

Here we see that the distribution of price between these two engine-location categories, front and rear, are distinct enough to take engine-location as a potential good predictor of price.

Let's examine "drive-wheels" and "price".

In [43]:
# drive-wheels
sns.boxplot(x="drive-wheels", y="price", data=df)
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f53165a5990>

Here we see that the distribution of price between the different drive-wheels categories differs; as such drive-wheels could potentially be a predictor of price.

Correlation

Correlation: a measure of the extent of interdependence between variables.

Causation: the relationship between cause and effect between two variables.

It is important to know the difference between these two and that correlation does not imply causation. Determining correlation is much simpler the determining causation as causation may require independent experimentation.

We can calculate the correlation between variables of type "int64" or "float64" using the method "corr":

In [46]:
df.corr()
Out[46]:
symboling normalized-losses wheel-base length width height curb-weight engine-size bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
symboling 1.000000 0.466264 -0.535987 -0.365404 -0.242423 -0.550160 -0.233118 -0.110581 -0.139896 -0.007992 -0.182196 0.075790 0.279719 -0.035527 0.036233 -0.082391
normalized-losses 0.466264 1.000000 -0.056661 0.019424 0.086802 -0.373737 0.099404 0.112360 -0.029800 0.055127 -0.114713 0.217300 0.239544 -0.225016 -0.181877 0.133999
wheel-base -0.535987 -0.056661 1.000000 0.876024 0.814507 0.590742 0.782097 0.572027 0.493203 0.157964 0.250313 0.371250 -0.360233 -0.470606 -0.543304 0.584642
length -0.365404 0.019424 0.876024 1.000000 0.857170 0.492063 0.880665 0.685025 0.608941 0.123913 0.159733 0.579731 -0.286035 -0.665192 -0.698142 0.690628
width -0.242423 0.086802 0.814507 0.857170 1.000000 0.306002 0.866201 0.729436 0.544879 0.188814 0.189867 0.615006 -0.245852 -0.633531 -0.680635 0.751265
height -0.550160 -0.373737 0.590742 0.492063 0.306002 1.000000 0.307581 0.074694 0.180327 -0.060822 0.259737 -0.086941 -0.309913 -0.049800 -0.104812 0.135486
curb-weight -0.233118 0.099404 0.782097 0.880665 0.866201 0.307581 1.000000 0.849072 0.644041 0.167412 0.156433 0.757994 -0.279350 -0.749543 -0.794889 0.834415
engine-size -0.110581 0.112360 0.572027 0.685025 0.729436 0.074694 0.849072 1.000000 0.572516 0.205806 0.028889 0.822649 -0.256753 -0.650546 -0.679571 0.872335
bore -0.139896 -0.029800 0.493203 0.608941 0.544879 0.180327 0.644041 0.572516 1.000000 -0.055390 0.001250 0.566838 -0.267338 -0.582121 -0.591390 0.543154
stroke -0.007992 0.055127 0.157964 0.123913 0.188814 -0.060822 0.167412 0.205806 -0.055390 1.000000 0.187854 0.097815 -0.063720 -0.034079 -0.034741 0.082267
compression-ratio -0.182196 -0.114713 0.250313 0.159733 0.189867 0.259737 0.156433 0.028889 0.001250 0.187854 1.000000 -0.214431 -0.435721 0.331425 0.268465 0.071107
horsepower 0.075790 0.217300 0.371250 0.579731 0.615006 -0.086941 0.757994 0.822649 0.566838 0.097815 -0.214431 1.000000 0.107882 -0.822138 -0.804587 0.809681
peak-rpm 0.279719 0.239544 -0.360233 -0.286035 -0.245852 -0.309913 -0.279350 -0.256753 -0.267338 -0.063720 -0.435721 0.107882 1.000000 -0.115358 -0.058605 -0.101542
city-mpg -0.035527 -0.225016 -0.470606 -0.665192 -0.633531 -0.049800 -0.749543 -0.650546 -0.582121 -0.034079 0.331425 -0.822138 -0.115358 1.000000 0.972044 -0.686571
highway-mpg 0.036233 -0.181877 -0.543304 -0.698142 -0.680635 -0.104812 -0.794889 -0.679571 -0.591390 -0.034741 0.268465 -0.804587 -0.058605 0.972044 1.000000 -0.704692
price -0.082391 0.133999 0.584642 0.690628 0.751265 0.135486 0.834415 0.872335 0.543154 0.082267 0.071107 0.809681 -0.101542 -0.686571 -0.704692 1.000000

The diagonal elements are always one

In [49]:
# Compute the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(12, 9))

# Generate a custom diverging colormap
# cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

plt.show()

Conclusion: Important Variables

We now have a better idea of what our data looks like and which variables are important to take into account when predicting the car price. We have narrowed it down to the following variables:

Continuous numerical variables:

  • Length
  • Width
  • Curb-weight
  • Engine-size
  • Horsepower
  • City-mpg
  • Highway-mpg
  • Wheel-base
  • Bore

Categorical variables:

  • Drive-wheels
  • Engine-location

As we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model's prediction performance.

A Few More Steps

Feature Selection

In [50]:
X = df[["length","width","curb-weight","engine-size","horsepower","city-mpg","highway-mpg","wheel-base","bore","drive-wheels","engine-location"]].copy()
y = df["price"].copy()
In [51]:
X.head()
Out[51]:
length width curb-weight engine-size horsepower city-mpg highway-mpg wheel-base bore drive-wheels engine-location
0 168.8 64.1 2548 130 111.0 21 27 88.6 3.47 rwd front
1 168.8 64.1 2548 130 111.0 21 27 88.6 3.47 rwd front
2 171.2 65.5 2823 152 154.0 19 26 94.5 2.68 rwd front
3 176.6 66.2 2337 109 102.0 24 30 99.8 3.19 fwd front
4 176.6 66.4 2824 136 115.0 18 22 99.4 3.19 4wd front
In [52]:
numerical_features = ["length","width","curb-weight","engine-size","horsepower","city-mpg","highway-mpg","wheel-base","bore"]
categorical_features = ["drive-wheels","engine-location"]

Feature Scaling

In [53]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X[numerical_features])

X_scaled
Out[53]:
array([[0.41343284, 0.32478632, 0.41117145, ..., 0.28947368, 0.05830904,
        0.66428571],
       [0.41343284, 0.32478632, 0.41117145, ..., 0.28947368, 0.05830904,
        0.66428571],
       [0.44925373, 0.44444444, 0.51784329, ..., 0.26315789, 0.2303207 ,
        0.1       ],
       ...,
       [0.7119403 , 0.73504274, 0.59115593, ..., 0.18421053, 0.65597668,
        0.74285714],
       [0.7119403 , 0.73504274, 0.67067494, ..., 0.28947368, 0.65597668,
        0.33571429],
       [0.7119403 , 0.73504274, 0.61055081, ..., 0.23684211, 0.65597668,
        0.88571429]])

One-hot encoding of categorical attributes

In [54]:
X_encoded = pd.get_dummies(X[categorical_features])
X_encoded.head()
Out[54]:
drive-wheels_4wd drive-wheels_fwd drive-wheels_rwd engine-location_front engine-location_rear
0 0 0 1 1 0
1 0 0 1 1 0
2 0 0 1 1 0
3 0 1 0 1 0
4 1 0 0 1 0
In [55]:
X_new = np.concatenate([X_scaled,X_encoded.values],axis=1)
X_new
Out[55]:
array([[0.41343284, 0.32478632, 0.41117145, ..., 1.        , 1.        ,
        0.        ],
       [0.41343284, 0.32478632, 0.41117145, ..., 1.        , 1.        ,
        0.        ],
       [0.44925373, 0.44444444, 0.51784329, ..., 1.        , 1.        ,
        0.        ],
       ...,
       [0.7119403 , 0.73504274, 0.59115593, ..., 1.        , 1.        ,
        0.        ],
       [0.7119403 , 0.73504274, 0.67067494, ..., 1.        , 1.        ,
        0.        ],
       [0.7119403 , 0.73504274, 0.61055081, ..., 1.        , 1.        ,
        0.        ]])

Create Training and Validation Data Split

In [56]:
from sklearn.model_selection import train_test_split

X_train,X_val,y_train,y_val = train_test_split(X_new,y,test_size=0.33,random_state=42)

Model Training

In [57]:
from sklearn.linear_model import LinearRegression

reg_lr = LinearRegression().fit(X_train,y_train)

Model Evaluation

In [58]:
from sklearn.metrics import mean_absolute_error

y_pred_lr = reg_lr.predict(X_val)

mae_lr = mean_absolute_error(y_pred_lr,y_val)

print("Mean Absolute Error of Linear Regression: {}".format(mae_lr))
Mean Absolute Error of Linear Regression: 2566.692742359411
In [ ]: