Supervised Learning¶

In this lab, we will be working on creating a model that predicts whether a horse which has colic will survive based on past medical conditions. The dataset is called Horse Colic Dataset. The column 'outcome' determines what happened to the horse, and will be the label.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

Data Exploration¶

df = pd.read_csv('horse.csv')
df.head()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 28 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   surgery                299 non-null    object 
 1   age                    299 non-null    object 
 2   hospital_number        299 non-null    int64  
 3   rectal_temp            239 non-null    float64
 4   pulse                  275 non-null    float64
 5   respiratory_rate       241 non-null    float64
 6   temp_of_extremities    243 non-null    object 
 7   peripheral_pulse       230 non-null    object 
 8   mucous_membrane        252 non-null    object 
 9   capillary_refill_time  267 non-null    object 
 10  pain                   244 non-null    object 
 11  peristalsis            255 non-null    object 
 12  abdominal_distention   243 non-null    object 
 13  nasogastric_tube       195 non-null    object 
 14  nasogastric_reflux     193 non-null    object 
 15  nasogastric_reflux_ph  53 non-null     float64
 16  rectal_exam_feces      197 non-null    object 
 17  abdomen                181 non-null    object 
 18  packed_cell_volume     270 non-null    float64
 19  total_protein          266 non-null    float64
 20  abdomo_appearance      134 non-null    object 
 21  abdomo_protein         101 non-null    float64
 22  outcome                299 non-null    object 
 23  surgical_lesion        299 non-null    object 
 24  lesion_1               299 non-null    int64  
 25  lesion_2               299 non-null    int64  
 26  lesion_3               299 non-null    int64  
 27  cp_data                299 non-null    object 
dtypes: float64(7), int64(4), object(17)
memory usage: 65.5+ KB

df.describe() #Only describes numeric columns, not categorical

df.describe(include='object')

Querying the dataset¶

Q1. Create any queries which will better help you understand the dataset.

#TODO

Data Visualization¶

Check out the bar plot below showing the outcome of different horses.

sns.countplot(x='outcome',data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x7f2c1cb00050>

Q2. Make your own visualizations and try to figure out important features. Feel free to use the examples from previous lab, but try to get creative.

sns.boxplot(x='packed_cell_volume',y='outcome',data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x7f2c1cb03890>

sns.boxplot(x='rectal_exam_feces',y='respiratory_rate',hue="outcome",data=df)
# We see there is some sepearation between outcome based on what the values of the features are

<matplotlib.axes._subplots.AxesSubplot at 0x7f2c544423d0>

sns.relplot(x="lesion_1",y="packed_cell_volume",hue='outcome',row='abdomen',data=df)
# On seeing this kind of distribution in the data we can decide a tree-based model would be better than a linear model

<seaborn.axisgrid.FacetGrid at 0x7f2c1a2aa790>

Fill missing values¶

Use fillna() function in pandas for filling missing values. Make sure all the features have correct dtypes.
Numeric: float or int
Categorical: object or int

# Check features have correct dtype
df_dtype_nunique = pd.concat([df.dtypes, df.nunique()],axis=1)
df_dtype_nunique.columns = ["dtype","unique"]   # Make sure you understand what we are checking here
df_dtype_nunique

We find all columns have correct dtype

# Calculate number of NaNs in each column
missing_count = df.isnull().sum()
missing_count[missing_count>0].sort_index()

abdomen                  118
abdominal_distention      56
abdomo_appearance        165
abdomo_protein           198
capillary_refill_time     32
mucous_membrane           47
nasogastric_reflux       106
nasogastric_reflux_ph    246
nasogastric_tube         104
packed_cell_volume        29
pain                      55
peripheral_pulse          69
peristalsis               44
pulse                     24
rectal_exam_feces        102
rectal_temp               60
respiratory_rate          58
temp_of_extremities       56
total_protein             33
dtype: int64

Q3. Fill all NaN values for numeric features with mean.

df.fillna(value=df.mean(),inplace=True)  #Since there is no categorical data of int64 dtype, we can do this freely. Had there been categorical data of type int64 we would have to do df.fillna(value=df[numerical_features].mean(),inplace=True)
# Only NaNs in float and int dtype columns are replaced with mean. [See output of df.mean() to understand this]
df.head()

Q4. Fill in NaN values for categorical features with mode.

df.fillna(value=df.mode().loc[0],inplace=True)  #Why did we do .loc[0] [See output of df.mode() to understand this]
#Now that NaNs in numerical features have already been filled, here only NaN's of categorical data will be replaced by mode. Beware df.mode() returns mode for all columns.
df.head()

df.isnull().any().any()

False

Transformation of Skewed Continuous Features¶

Sometimes continuous features are distributed such that the values reside near one central value, but there are sometimes a non-trivial amount of larger or smaller values which may negatively affect the learning algorithm. Therefore it is common to perform a transformation such as log transformation on these features. Let us take an example:

sns.distplot(df['total_protein'],kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x7f2c19ebfd90>

We can see that most of values lie between 0-20, however there are a large amount of data points which are greater than 40. We need to transform this feature.

Q5. Carry out a log transformation on total_protein by applying natural log on all values.

df['total_protein'] = np.log(df['total_protein'])
sns.distplot(df['total_protein'],kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x7f2c19d86410>

Feature Selection¶

Q6. Select features from the dataset based on your analysis.

# abdomo_appearance, abdomo_protein, nasogastric_reflux_ph are probably not good features because they have a lot of missing values
# lesion_2 and lesion_3 don't have much variation in their value (mostly 0)
# You can choose to keep all features but it's not recommended because irrelevant or partially relevant features can negatively impact model performance.
numerical_features = ['packed_cell_volume','respiratory_rate','total_protein','rectal_temp','pulse','lesion_1']
categorical_features = ['abdomen','rectal_exam_feces','temp_of_extremities']
X = df[numerical_features+categorical_features]
y = df["outcome"]

Ordinal and One-hot encoding of categorical attributes¶

In learning algorithms, values are expected to be numeric. However, categorical attributes can provide a lot of information to the model. So the way we incorporate these attributes is by encoding them.

Q7. Encode the categorical variables (Use pd.get_dummies() for one-hot encode).

#TODO

#Ordinal (can also use OrdinalEncoder())
temp_code = {'cold':0,'cool':1,'normal':2,'warm':3}
X['temp_of_extremities'] = X['temp_of_extremities'].map(temp_code)

#One-hot
X = pd.get_dummies(data=X,columns=['abdomen','rectal_exam_feces'])

X.head()

Encoding the labels¶

One more step before we move on is converting the categorical labels in 'outcome' to numbers. This is called label encoding, and is done with the help of LabelEncoder. Check the sklearn documentation example for help.

Q8. Using LabelEncoder, encode 'outcome'

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()  #Instantiate the encoder
y = le.fit_transform(y)  #Fit and transform the labels using labelencoder

y

array([0, 1, 2, 0, 0, 2, 2, 0, 1, 2, 2, 2, 2, 0, 2, 0, 0, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 0, 2, 0, 1, 2, 2, 2, 1, 1, 2, 2, 0, 0, 2, 2, 1,
       1, 0, 2, 2, 0, 0, 2, 0, 2, 2, 1, 0, 2, 0, 0, 0, 2, 2, 0, 1, 2, 2,
       2, 2, 2, 2, 1, 2, 0, 0, 0, 1, 2, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 1,
       2, 2, 2, 0, 2, 0, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 0, 2, 2, 2, 1, 0,
       0, 2, 2, 0, 2, 2, 2, 1, 2, 2, 2, 1, 2, 0, 1, 2, 2, 2, 1, 2, 0, 0,
       2, 0, 0, 0, 2, 2, 2, 2, 0, 0, 2, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2,
       2, 2, 1, 2, 2, 2, 0, 2, 1, 2, 1, 2, 2, 2, 1, 2, 0, 2, 0, 0, 2, 2,
       0, 2, 1, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 1, 0, 2,
       2, 2, 0, 1, 2, 2, 1, 2, 0, 1, 2, 0, 0, 2, 2, 2, 2, 2, 2, 1, 2, 1,
       2, 2, 2, 2, 2, 0, 2, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 1, 1, 0, 2, 0,
       2, 2, 0, 2, 0, 2, 2, 1, 0, 0, 2, 2, 0, 1, 0, 2, 0, 2, 2, 2, 0, 2,
       1, 2, 2, 2, 2, 0, 2, 2, 0, 1, 0, 1, 2, 2, 2, 2, 0, 0, 2, 2, 2, 1,
       2, 2, 2, 2, 0, 0, 1, 2, 1, 1, 0, 2, 1])

Creating a Train and Test Split¶

Q9. Create training and validation split on data. Check out train_test_split() function from sklearn to do this.

from sklearn.model_selection import train_test_split

X_train,X_val,y_train,y_val = train_test_split(X,y,test_size=0.33,random_state=42)  #Checkout what does random_state do

Scaling of numeric attributes¶

Q10. Scale numeric attributes using MinMaxScaler, StandardScaler(Z-score normalization) or RobustScaler. Scale train and validation datasets separately!

#TODO
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_train[numerical_features] = scaler.fit_transform(X_train[numerical_features])
X_val[numerical_features] = scaler.transform(X_val[numerical_features])  

# It is important to scale tain and val data separately because val is supposed to be unseen data on which we test our models. If we scale them together, data from val set will also be considered while calculating mean, median, IQR, etc

X_train[numerical_features].head()

Model Selection and Training¶

Q11. Select 2 classifiers, instantiate them and train them. A few models are given below:

DecisionTreeClassifier
GaussianNB
RandomForestClassifier
Support Vector Machine
Any other classifier, look them up!

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Initialize and train
clf1 = DecisionTreeClassifier().fit(X_train,y_train)
clf2 = RandomForestClassifier().fit(X_train,y_train)

Cross Validation and Performance Analysis¶

from sklearn.metrics import accuracy_score  #Find out what is accuracy_score

y_pred_1 = clf1.predict(X_val)
y_pred_2 = clf2.predict(X_val)

acc1 = accuracy_score(y_pred_1,y_val)*100
acc2 = accuracy_score(y_pred_2,y_val)*100

print("Accuracy score of clf1: {}".format(acc1))
print("Accuracy score of clf2: {}".format(acc2))

Accuracy score of clf1: 59.59595959595959
Accuracy score of clf2: 75.75757575757575

Hyperparameter Tuning¶

How do we optimize the classifier in order to produce the best results? We need to tune the model by varying various hyperparameters. We can use GridSearchCV to simplify the whole process.

For GridSearchCV, carry out the following steps(We will only do this for one classifier, so choose one of your previous classifiers):

Initialize a new classifier object
Create a dictionary of parameters you wish to tune.(e.g. parameters = {'param_name':[list of values]})
Note: Avoid tuning the max_features parameter of your learner if that parameter is available!
Use make_scorer to create an accuracy_score object
Perform grid search on the classifier clf using the 'scorer', and store it in grid_obj.
Fit the grid search object to the training data (X_train, y_train), and store it in grid_fit.

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

#TODO
clf = RandomForestClassifier()        #Initialize the classifier object

parameters = {'n_estimators':[10,50,100]}    #Dictionary of parameters

scorer = make_scorer(accuracy_score)         #Initialize the scorer using make_scorer

grid_obj = GridSearchCV(clf,parameters,scoring=scorer)         #Initialize a GridSearchCV object with above parameters,scorer and classifier

grid_fit = grid_obj.fit(X_train,y_train)        #Fit the gridsearch object with X_train,y_train

best_clf = grid_fit.best_estimator_         #Get the best estimator. For this, check documentation of GridSearchCV object

unoptimized_predictions = (clf.fit(X_train, y_train)).predict(X_val)      #Using the unoptimized classifiers, generate predictions
optimized_predictions = best_clf.predict(X_val)        #Same, but use the best estimator

acc_unop = accuracy_score(y_val, unoptimized_predictions)*100       #Calculate accuracy for unoptimized model
acc_op = accuracy_score(y_val, optimized_predictions)*100         #Calculate accuracy for optimized model

print("Accuracy score on unoptimized model:{}".format(acc_unop))
print("Accuracy score on optimized model:{}".format(acc_op))

Accuracy score on unoptimized model:71.71717171717171
Accuracy score on optimized model:71.71717171717171

We have learnt some methods to boost our accuracy. Try messing around with the above functions and bring up model performance.

	surgery	age	hospital_number	rectal_temp	pulse	respiratory_rate	temp_of_extremities	peripheral_pulse	mucous_membrane	capillary_refill_time	...	packed_cell_volume	total_protein	abdomo_appearance	abdomo_protein	outcome	surgical_lesion	lesion_1	cp_data
0	no	adult	530101	38.5	66.0	28.0	cool	reduced	NaN	more_3_sec	...	45.0	8.4	NaN	NaN	died	no	11300	no
1	yes	adult	534817	39.2	88.0	20.0	NaN	NaN	pale_cyanotic	less_3_sec	...	50.0	85.0	cloudy	2.0	euthanized	no	2208	no
2	no	adult	530334	38.3	40.0	24.0	normal	normal	pale_pink	less_3_sec	...	33.0	6.7	NaN	NaN	lived	no	0	yes
3	yes	young	5290409	39.1	164.0	84.0	cold	normal	dark_cyanotic	more_3_sec	...	48.0	7.2	serosanguious	5.3	died	yes	2208	yes
4	no	adult	530255	37.3	104.0	35.0	NaN	NaN	dark_cyanotic	more_3_sec	...	74.0	7.4	NaN	NaN	died	no	4300	no

	hospital_number	rectal_temp	pulse	respiratory_rate	nasogastric_reflux_ph	packed_cell_volume	total_protein	abdomo_protein	lesion_1	lesion_2	lesion_3
count	2.990000e+02	239.000000	275.000000	241.000000	53.000000	270.000000	266.000000	101.000000	299.000000	299.000000	299.000000
mean	1.087733e+06	38.168619	72.000000	30.460581	4.707547	46.307407	24.274436	3.039604	3659.709030	90.528428	7.387960
std	1.532032e+06	0.733744	28.646219	17.666102	1.982311	10.436743	27.364194	1.967947	5408.472421	650.637139	127.749768
min	5.184760e+05	35.400000	30.000000	8.000000	1.000000	23.000000	3.300000	0.100000	0.000000	0.000000	0.000000
25%	5.289040e+05	37.800000	48.000000	18.000000	3.000000	38.000000	6.500000	2.000000	2111.500000	0.000000	0.000000
50%	5.303010e+05	38.200000	64.000000	25.000000	5.000000	45.000000	7.500000	2.300000	2322.000000	0.000000	0.000000
75%	5.347360e+05	38.500000	88.000000	36.000000	6.500000	52.000000	56.750000	3.900000	3209.000000	0.000000	0.000000
max	5.305629e+06	40.800000	184.000000	96.000000	7.500000	75.000000	89.000000	10.100000	41110.000000	7111.000000	2209.000000

	surgery	age	hospital_number	rectal_temp	pulse	respiratory_rate	temp_of_extremities	peripheral_pulse	mucous_membrane	capillary_refill_time	...	packed_cell_volume	total_protein	abdomo_appearance	abdomo_protein	outcome	surgical_lesion	lesion_1	cp_data
0	no	adult	530101	38.5	66.0	28.0	cool	reduced	NaN	more_3_sec	...	45.0	8.4	NaN	3.039604	died	no	11300	no
1	yes	adult	534817	39.2	88.0	20.0	NaN	NaN	pale_cyanotic	less_3_sec	...	50.0	85.0	cloudy	2.000000	euthanized	no	2208	no
2	no	adult	530334	38.3	40.0	24.0	normal	normal	pale_pink	less_3_sec	...	33.0	6.7	NaN	3.039604	lived	no	0	yes
3	yes	young	5290409	39.1	164.0	84.0	cold	normal	dark_cyanotic	more_3_sec	...	48.0	7.2	serosanguious	5.300000	died	yes	2208	yes
4	no	adult	530255	37.3	104.0	35.0	NaN	NaN	dark_cyanotic	more_3_sec	...	74.0	7.4	NaN	3.039604	died	no	4300	no

	surgery	age	hospital_number	rectal_temp	pulse	respiratory_rate	temp_of_extremities	peripheral_pulse	mucous_membrane	capillary_refill_time	...	packed_cell_volume	total_protein	abdomo_appearance	abdomo_protein	outcome	surgical_lesion	lesion_1	cp_data
0	no	adult	530101	38.5	66.0	28.0	cool	reduced	normal_pink	more_3_sec	...	45.0	8.4	cloudy	3.039604	died	no	11300	no
1	yes	adult	534817	39.2	88.0	20.0	cool	normal	pale_cyanotic	less_3_sec	...	50.0	85.0	cloudy	2.000000	euthanized	no	2208	no
2	no	adult	530334	38.3	40.0	24.0	normal	normal	pale_pink	less_3_sec	...	33.0	6.7	cloudy	3.039604	lived	no	0	yes
3	yes	young	5290409	39.1	164.0	84.0	cold	normal	dark_cyanotic	more_3_sec	...	48.0	7.2	serosanguious	5.300000	died	yes	2208	yes
4	no	adult	530255	37.3	104.0	35.0	cool	normal	dark_cyanotic	more_3_sec	...	74.0	7.4	cloudy	3.039604	died	no	4300	no

	packed_cell_volume	respiratory_rate	total_protein	rectal_temp	pulse	lesion_1	temp_of_extremities	abdomen_distend_large	abdomen_normal	abdomen_other	rectal_exam_feces_absent	rectal_exam_feces_decreased	rectal_exam_feces_normal
0	45.0	28.0	2.128232	38.5	66.0	11300	1	1	0	0	0	1	0
1	50.0	20.0	4.442651	39.2	88.0	2208	1	0	0	1	1	0	0
2	33.0	24.0	1.902108	38.3	40.0	0	2	0	1	0	0	0	1
3	48.0	84.0	1.974081	39.1	164.0	2208	0	1	0	0	0	1	0
4	74.0	35.0	2.001480	37.3	104.0	4300	1	1	0	0	1	0	0

	surgery	age	temp_of_extremities	peripheral_pulse	mucous_membrane	capillary_refill_time	pain	peristalsis	abdominal_distention	nasogastric_tube	nasogastric_reflux	rectal_exam_feces	abdomen	abdomo_appearance	outcome	surgical_lesion	cp_data
count	299	299	243	230	252	267	244	255	243	195	193	197	181	134	299	299	299
unique	2	2	4	4	6	3	5	4	4	3	3	4	5	3	3	2	2
top	yes	adult	cool	normal	normal_pink	less_3_sec	mild_pain	hypomotile	none	slight	none	absent	distend_large	cloudy	lived	yes	no
freq	180	275	108	114	79	187	67	127	75	101	119	79	78	47	178	190	200

	dtype	unique
surgery	object	2
age	object	2
hospital_number	int64	283
rectal_temp	float64	40
pulse	float64	52
respiratory_rate	float64	40
temp_of_extremities	object	4
peripheral_pulse	object	4
mucous_membrane	object	6
capillary_refill_time	object	3
pain	object	5
peristalsis	object	4
abdominal_distention	object	4
nasogastric_tube	object	3
nasogastric_reflux	object	3
nasogastric_reflux_ph	float64	20
rectal_exam_feces	object	4
abdomen	object	5
packed_cell_volume	float64	50
total_protein	float64	80
abdomo_appearance	object	3
abdomo_protein	float64	37
outcome	object	3
surgical_lesion	object	2
lesion_1	int64	61
lesion_2	int64	6
lesion_3	int64	2
cp_data	object	2

	packed_cell_volume	respiratory_rate	total_protein	rectal_temp	pulse	lesion_1
101	0.118855	-0.625000	0.526478	0.000000	-0.25	0.269833
69	-0.272727	-1.125000	-0.109705	-0.240885	-0.55	0.223079
31	0.636364	-1.125000	0.014612	-1.383742	0.95	-0.226561
172	2.181818	0.028786	0.037076	0.000000	0.85	-0.226561
148	-1.000000	-0.375000	0.976286	0.759115	-0.55	-1.324297