Supervised Learning

In this lab, we will be working on creating a model that predicts whether a horse which has colic will survive based on past medical conditions. The dataset is called Horse Colic Dataset. The column 'outcome' determines what happened to the horse, and will be the label.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

Data Exploration

In [2]:
df = pd.read_csv('horse.csv')
df.head()
Out[2]:
surgery age hospital_number rectal_temp pulse respiratory_rate temp_of_extremities peripheral_pulse mucous_membrane capillary_refill_time ... packed_cell_volume total_protein abdomo_appearance abdomo_protein outcome surgical_lesion lesion_1 lesion_2 lesion_3 cp_data
0 no adult 530101 38.5 66.0 28.0 cool reduced NaN more_3_sec ... 45.0 8.4 NaN NaN died no 11300 0 0 no
1 yes adult 534817 39.2 88.0 20.0 NaN NaN pale_cyanotic less_3_sec ... 50.0 85.0 cloudy 2.0 euthanized no 2208 0 0 no
2 no adult 530334 38.3 40.0 24.0 normal normal pale_pink less_3_sec ... 33.0 6.7 NaN NaN lived no 0 0 0 yes
3 yes young 5290409 39.1 164.0 84.0 cold normal dark_cyanotic more_3_sec ... 48.0 7.2 serosanguious 5.3 died yes 2208 0 0 yes
4 no adult 530255 37.3 104.0 35.0 NaN NaN dark_cyanotic more_3_sec ... 74.0 7.4 NaN NaN died no 4300 0 0 no

5 rows × 28 columns

In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 28 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   surgery                299 non-null    object 
 1   age                    299 non-null    object 
 2   hospital_number        299 non-null    int64  
 3   rectal_temp            239 non-null    float64
 4   pulse                  275 non-null    float64
 5   respiratory_rate       241 non-null    float64
 6   temp_of_extremities    243 non-null    object 
 7   peripheral_pulse       230 non-null    object 
 8   mucous_membrane        252 non-null    object 
 9   capillary_refill_time  267 non-null    object 
 10  pain                   244 non-null    object 
 11  peristalsis            255 non-null    object 
 12  abdominal_distention   243 non-null    object 
 13  nasogastric_tube       195 non-null    object 
 14  nasogastric_reflux     193 non-null    object 
 15  nasogastric_reflux_ph  53 non-null     float64
 16  rectal_exam_feces      197 non-null    object 
 17  abdomen                181 non-null    object 
 18  packed_cell_volume     270 non-null    float64
 19  total_protein          266 non-null    float64
 20  abdomo_appearance      134 non-null    object 
 21  abdomo_protein         101 non-null    float64
 22  outcome                299 non-null    object 
 23  surgical_lesion        299 non-null    object 
 24  lesion_1               299 non-null    int64  
 25  lesion_2               299 non-null    int64  
 26  lesion_3               299 non-null    int64  
 27  cp_data                299 non-null    object 
dtypes: float64(7), int64(4), object(17)
memory usage: 65.5+ KB
In [4]:
df.describe() #Only describes numeric columns, not categorical
Out[4]:
hospital_number rectal_temp pulse respiratory_rate nasogastric_reflux_ph packed_cell_volume total_protein abdomo_protein lesion_1 lesion_2 lesion_3
count 2.990000e+02 239.000000 275.000000 241.000000 53.000000 270.000000 266.000000 101.000000 299.000000 299.000000 299.000000
mean 1.087733e+06 38.168619 72.000000 30.460581 4.707547 46.307407 24.274436 3.039604 3659.709030 90.528428 7.387960
std 1.532032e+06 0.733744 28.646219 17.666102 1.982311 10.436743 27.364194 1.967947 5408.472421 650.637139 127.749768
min 5.184760e+05 35.400000 30.000000 8.000000 1.000000 23.000000 3.300000 0.100000 0.000000 0.000000 0.000000
25% 5.289040e+05 37.800000 48.000000 18.000000 3.000000 38.000000 6.500000 2.000000 2111.500000 0.000000 0.000000
50% 5.303010e+05 38.200000 64.000000 25.000000 5.000000 45.000000 7.500000 2.300000 2322.000000 0.000000 0.000000
75% 5.347360e+05 38.500000 88.000000 36.000000 6.500000 52.000000 56.750000 3.900000 3209.000000 0.000000 0.000000
max 5.305629e+06 40.800000 184.000000 96.000000 7.500000 75.000000 89.000000 10.100000 41110.000000 7111.000000 2209.000000
In [5]:
df.describe(include='object')
Out[5]:
surgery age temp_of_extremities peripheral_pulse mucous_membrane capillary_refill_time pain peristalsis abdominal_distention nasogastric_tube nasogastric_reflux rectal_exam_feces abdomen abdomo_appearance outcome surgical_lesion cp_data
count 299 299 243 230 252 267 244 255 243 195 193 197 181 134 299 299 299
unique 2 2 4 4 6 3 5 4 4 3 3 4 5 3 3 2 2
top yes adult cool normal normal_pink less_3_sec mild_pain hypomotile none slight none absent distend_large cloudy lived yes no
freq 180 275 108 114 79 187 67 127 75 101 119 79 78 47 178 190 200

Querying the dataset

Q1. Create any queries which will better help you understand the dataset.

In [6]:
#TODO

Data Visualization

Check out the bar plot below showing the outcome of different horses.

In [7]:
sns.countplot(x='outcome',data=df)
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c1cb00050>

Q2. Make your own visualizations and try to figure out important features. Feel free to use the examples from previous lab, but try to get creative.

In [8]:
sns.boxplot(x='packed_cell_volume',y='outcome',data=df)
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c1cb03890>
In [9]:
sns.boxplot(x='rectal_exam_feces',y='respiratory_rate',hue="outcome",data=df)
# We see there is some sepearation between outcome based on what the values of the features are
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c544423d0>
In [10]:
sns.relplot(x="lesion_1",y="packed_cell_volume",hue='outcome',row='abdomen',data=df)
# On seeing this kind of distribution in the data we can decide a tree-based model would be better than a linear model
Out[10]:
<seaborn.axisgrid.FacetGrid at 0x7f2c1a2aa790>

Fill missing values

Use fillna() function in pandas for filling missing values. Make sure all the features have correct dtypes.
Numeric: float or int
Categorical: object or int

In [11]:
# Check features have correct dtype
df_dtype_nunique = pd.concat([df.dtypes, df.nunique()],axis=1)
df_dtype_nunique.columns = ["dtype","unique"]   # Make sure you understand what we are checking here
df_dtype_nunique
Out[11]:
dtype unique
surgery object 2
age object 2
hospital_number int64 283
rectal_temp float64 40
pulse float64 52
respiratory_rate float64 40
temp_of_extremities object 4
peripheral_pulse object 4
mucous_membrane object 6
capillary_refill_time object 3
pain object 5
peristalsis object 4
abdominal_distention object 4
nasogastric_tube object 3
nasogastric_reflux object 3
nasogastric_reflux_ph float64 20
rectal_exam_feces object 4
abdomen object 5
packed_cell_volume float64 50
total_protein float64 80
abdomo_appearance object 3
abdomo_protein float64 37
outcome object 3
surgical_lesion object 2
lesion_1 int64 61
lesion_2 int64 6
lesion_3 int64 2
cp_data object 2

We find all columns have correct dtype

In [12]:
# Calculate number of NaNs in each column
missing_count = df.isnull().sum()
missing_count[missing_count>0].sort_index()
Out[12]:
abdomen                  118
abdominal_distention      56
abdomo_appearance        165
abdomo_protein           198
capillary_refill_time     32
mucous_membrane           47
nasogastric_reflux       106
nasogastric_reflux_ph    246
nasogastric_tube         104
packed_cell_volume        29
pain                      55
peripheral_pulse          69
peristalsis               44
pulse                     24
rectal_exam_feces        102
rectal_temp               60
respiratory_rate          58
temp_of_extremities       56
total_protein             33
dtype: int64

Q3. Fill all NaN values for numeric features with mean.

In [13]:
df.fillna(value=df.mean(),inplace=True)  #Since there is no categorical data of int64 dtype, we can do this freely. Had there been categorical data of type int64 we would have to do df.fillna(value=df[numerical_features].mean(),inplace=True)
# Only NaNs in float and int dtype columns are replaced with mean. [See output of df.mean() to understand this]
df.head()
Out[13]:
surgery age hospital_number rectal_temp pulse respiratory_rate temp_of_extremities peripheral_pulse mucous_membrane capillary_refill_time ... packed_cell_volume total_protein abdomo_appearance abdomo_protein outcome surgical_lesion lesion_1 lesion_2 lesion_3 cp_data
0 no adult 530101 38.5 66.0 28.0 cool reduced NaN more_3_sec ... 45.0 8.4 NaN 3.039604 died no 11300 0 0 no
1 yes adult 534817 39.2 88.0 20.0 NaN NaN pale_cyanotic less_3_sec ... 50.0 85.0 cloudy 2.000000 euthanized no 2208 0 0 no
2 no adult 530334 38.3 40.0 24.0 normal normal pale_pink less_3_sec ... 33.0 6.7 NaN 3.039604 lived no 0 0 0 yes
3 yes young 5290409 39.1 164.0 84.0 cold normal dark_cyanotic more_3_sec ... 48.0 7.2 serosanguious 5.300000 died yes 2208 0 0 yes
4 no adult 530255 37.3 104.0 35.0 NaN NaN dark_cyanotic more_3_sec ... 74.0 7.4 NaN 3.039604 died no 4300 0 0 no

5 rows × 28 columns

Q4. Fill in NaN values for categorical features with mode.

In [14]:
df.fillna(value=df.mode().loc[0],inplace=True)  #Why did we do .loc[0] [See output of df.mode() to understand this]
#Now that NaNs in numerical features have already been filled, here only NaN's of categorical data will be replaced by mode. Beware df.mode() returns mode for all columns.
df.head()
Out[14]:
surgery age hospital_number rectal_temp pulse respiratory_rate temp_of_extremities peripheral_pulse mucous_membrane capillary_refill_time ... packed_cell_volume total_protein abdomo_appearance abdomo_protein outcome surgical_lesion lesion_1 lesion_2 lesion_3 cp_data
0 no adult 530101 38.5 66.0 28.0 cool reduced normal_pink more_3_sec ... 45.0 8.4 cloudy 3.039604 died no 11300 0 0 no
1 yes adult 534817 39.2 88.0 20.0 cool normal pale_cyanotic less_3_sec ... 50.0 85.0 cloudy 2.000000 euthanized no 2208 0 0 no
2 no adult 530334 38.3 40.0 24.0 normal normal pale_pink less_3_sec ... 33.0 6.7 cloudy 3.039604 lived no 0 0 0 yes
3 yes young 5290409 39.1 164.0 84.0 cold normal dark_cyanotic more_3_sec ... 48.0 7.2 serosanguious 5.300000 died yes 2208 0 0 yes
4 no adult 530255 37.3 104.0 35.0 cool normal dark_cyanotic more_3_sec ... 74.0 7.4 cloudy 3.039604 died no 4300 0 0 no

5 rows × 28 columns

In [15]:
df.isnull().any().any()
Out[15]:
False

Transformation of Skewed Continuous Features

Sometimes continuous features are distributed such that the values reside near one central value, but there are sometimes a non-trivial amount of larger or smaller values which may negatively affect the learning algorithm. Therefore it is common to perform a transformation such as log transformation on these features. Let us take an example:

In [16]:
sns.distplot(df['total_protein'],kde = False)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c19ebfd90>

We can see that most of values lie between 0-20, however there are a large amount of data points which are greater than 40. We need to transform this feature.

Q5. Carry out a log transformation on total_protein by applying natural log on all values.

In [17]:
df['total_protein'] = np.log(df['total_protein'])
sns.distplot(df['total_protein'],kde = False)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c19d86410>

Feature Selection

Q6. Select features from the dataset based on your analysis.

In [18]:
# abdomo_appearance, abdomo_protein, nasogastric_reflux_ph are probably not good features because they have a lot of missing values
# lesion_2 and lesion_3 don't have much variation in their value (mostly 0)
# You can choose to keep all features but it's not recommended because irrelevant or partially relevant features can negatively impact model performance.
numerical_features = ['packed_cell_volume','respiratory_rate','total_protein','rectal_temp','pulse','lesion_1']
categorical_features = ['abdomen','rectal_exam_feces','temp_of_extremities']
X = df[numerical_features+categorical_features]
y = df["outcome"]

Ordinal and One-hot encoding of categorical attributes

In learning algorithms, values are expected to be numeric. However, categorical attributes can provide a lot of information to the model. So the way we incorporate these attributes is by encoding them.

Q7. Encode the categorical variables (Use pd.get_dummies() for one-hot encode).

In [19]:
#TODO

#Ordinal (can also use OrdinalEncoder())
temp_code = {'cold':0,'cool':1,'normal':2,'warm':3}
X['temp_of_extremities'] = X['temp_of_extremities'].map(temp_code)

#One-hot
X = pd.get_dummies(data=X,columns=['abdomen','rectal_exam_feces'])

X.head()
Out[19]:
packed_cell_volume respiratory_rate total_protein rectal_temp pulse lesion_1 temp_of_extremities abdomen_distend_large abdomen_distend_small abdomen_firm abdomen_normal abdomen_other rectal_exam_feces_absent rectal_exam_feces_decreased rectal_exam_feces_increased rectal_exam_feces_normal
0 45.0 28.0 2.128232 38.5 66.0 11300 1 1 0 0 0 0 0 1 0 0
1 50.0 20.0 4.442651 39.2 88.0 2208 1 0 0 0 0 1 1 0 0 0
2 33.0 24.0 1.902108 38.3 40.0 0 2 0 0 0 1 0 0 0 0 1
3 48.0 84.0 1.974081 39.1 164.0 2208 0 1 0 0 0 0 0 1 0 0
4 74.0 35.0 2.001480 37.3 104.0 4300 1 1 0 0 0 0 1 0 0 0

Encoding the labels

One more step before we move on is converting the categorical labels in 'outcome' to numbers. This is called label encoding, and is done with the help of LabelEncoder. Check the sklearn documentation example for help.

Q8. Using LabelEncoder, encode 'outcome'

In [20]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()  #Instantiate the encoder
y = le.fit_transform(y)  #Fit and transform the labels using labelencoder

y
Out[20]:
array([0, 1, 2, 0, 0, 2, 2, 0, 1, 2, 2, 2, 2, 0, 2, 0, 0, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 0, 2, 0, 1, 2, 2, 2, 1, 1, 2, 2, 0, 0, 2, 2, 1,
       1, 0, 2, 2, 0, 0, 2, 0, 2, 2, 1, 0, 2, 0, 0, 0, 2, 2, 0, 1, 2, 2,
       2, 2, 2, 2, 1, 2, 0, 0, 0, 1, 2, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 1,
       2, 2, 2, 0, 2, 0, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 0, 2, 2, 2, 1, 0,
       0, 2, 2, 0, 2, 2, 2, 1, 2, 2, 2, 1, 2, 0, 1, 2, 2, 2, 1, 2, 0, 0,
       2, 0, 0, 0, 2, 2, 2, 2, 0, 0, 2, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2,
       2, 2, 1, 2, 2, 2, 0, 2, 1, 2, 1, 2, 2, 2, 1, 2, 0, 2, 0, 0, 2, 2,
       0, 2, 1, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 1, 0, 2,
       2, 2, 0, 1, 2, 2, 1, 2, 0, 1, 2, 0, 0, 2, 2, 2, 2, 2, 2, 1, 2, 1,
       2, 2, 2, 2, 2, 0, 2, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 1, 1, 0, 2, 0,
       2, 2, 0, 2, 0, 2, 2, 1, 0, 0, 2, 2, 0, 1, 0, 2, 0, 2, 2, 2, 0, 2,
       1, 2, 2, 2, 2, 0, 2, 2, 0, 1, 0, 1, 2, 2, 2, 2, 0, 0, 2, 2, 2, 1,
       2, 2, 2, 2, 0, 0, 1, 2, 1, 1, 0, 2, 1])

Creating a Train and Test Split

Q9. Create training and validation split on data. Check out train_test_split() function from sklearn to do this.

In [21]:
from sklearn.model_selection import train_test_split

X_train,X_val,y_train,y_val = train_test_split(X,y,test_size=0.33,random_state=42)  #Checkout what does random_state do

Scaling of numeric attributes

Q10. Scale numeric attributes using MinMaxScaler, StandardScaler(Z-score normalization) or RobustScaler. Scale train and validation datasets separately!

In [22]:
#TODO
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_train[numerical_features] = scaler.fit_transform(X_train[numerical_features])
X_val[numerical_features] = scaler.transform(X_val[numerical_features])  

# It is important to scale tain and val data separately because val is supposed to be unseen data on which we test our models. If we scale them together, data from val set will also be considered while calculating mean, median, IQR, etc

X_train[numerical_features].head()
Out[22]:
packed_cell_volume respiratory_rate total_protein rectal_temp pulse lesion_1
101 0.118855 -0.625000 0.526478 0.000000 -0.25 0.269833
69 -0.272727 -1.125000 -0.109705 -0.240885 -0.55 0.223079
31 0.636364 -1.125000 0.014612 -1.383742 0.95 -0.226561
172 2.181818 0.028786 0.037076 0.000000 0.85 -0.226561
148 -1.000000 -0.375000 0.976286 0.759115 -0.55 -1.324297

Model Selection and Training

Q11. Select 2 classifiers, instantiate them and train them. A few models are given below:

  • DecisionTreeClassifier
  • GaussianNB
  • RandomForestClassifier
  • Support Vector Machine
  • Any other classifier, look them up!
In [23]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Initialize and train
clf1 = DecisionTreeClassifier().fit(X_train,y_train)
clf2 = RandomForestClassifier().fit(X_train,y_train)

Cross Validation and Performance Analysis

In [24]:
from sklearn.metrics import accuracy_score  #Find out what is accuracy_score

y_pred_1 = clf1.predict(X_val)
y_pred_2 = clf2.predict(X_val)

acc1 = accuracy_score(y_pred_1,y_val)*100
acc2 = accuracy_score(y_pred_2,y_val)*100

print("Accuracy score of clf1: {}".format(acc1))
print("Accuracy score of clf2: {}".format(acc2))
Accuracy score of clf1: 59.59595959595959
Accuracy score of clf2: 75.75757575757575

Hyperparameter Tuning

How do we optimize the classifier in order to produce the best results? We need to tune the model by varying various hyperparameters. We can use GridSearchCV to simplify the whole process.

For GridSearchCV, carry out the following steps(We will only do this for one classifier, so choose one of your previous classifiers):

  • Initialize a new classifier object
  • Create a dictionary of parameters you wish to tune.(e.g. parameters = {'param_name':[list of values]})
  • Note: Avoid tuning the max_features parameter of your learner if that parameter is available!
  • Use make_scorer to create an accuracy_score object
  • Perform grid search on the classifier clf using the 'scorer', and store it in grid_obj.
  • Fit the grid search object to the training data (X_train, y_train), and store it in grid_fit.
In [25]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

#TODO
clf = RandomForestClassifier()        #Initialize the classifier object

parameters = {'n_estimators':[10,50,100]}    #Dictionary of parameters

scorer = make_scorer(accuracy_score)         #Initialize the scorer using make_scorer

grid_obj = GridSearchCV(clf,parameters,scoring=scorer)         #Initialize a GridSearchCV object with above parameters,scorer and classifier

grid_fit = grid_obj.fit(X_train,y_train)        #Fit the gridsearch object with X_train,y_train

best_clf = grid_fit.best_estimator_         #Get the best estimator. For this, check documentation of GridSearchCV object

unoptimized_predictions = (clf.fit(X_train, y_train)).predict(X_val)      #Using the unoptimized classifiers, generate predictions
optimized_predictions = best_clf.predict(X_val)        #Same, but use the best estimator

acc_unop = accuracy_score(y_val, unoptimized_predictions)*100       #Calculate accuracy for unoptimized model
acc_op = accuracy_score(y_val, optimized_predictions)*100         #Calculate accuracy for optimized model

print("Accuracy score on unoptimized model:{}".format(acc_unop))
print("Accuracy score on optimized model:{}".format(acc_op))
Accuracy score on unoptimized model:71.71717171717171
Accuracy score on optimized model:71.71717171717171

We have learnt some methods to boost our accuracy. Try messing around with the above functions and bring up model performance.

In [ ]: