In this lab, we will be working on creating a model that predicts whether a horse which has colic will survive based on past medical conditions. The dataset is called Horse Colic Dataset. The column 'outcome' determines what happened to the horse, and will be the label.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('horse.csv')
df.head()
df.info()
df.describe() #Only describes numeric columns, not categorical
df.describe(include='object')
Q1. Create any queries which will better help you understand the dataset.
#TODO
Check out the bar plot below showing the outcome of different horses.
sns.countplot(x='outcome',data=df)
Q2. Make your own visualizations and try to figure out important features. Feel free to use the examples from previous lab, but try to get creative.
sns.boxplot(x='packed_cell_volume',y='outcome',data=df)
sns.boxplot(x='rectal_exam_feces',y='respiratory_rate',hue="outcome",data=df)
# We see there is some sepearation between outcome based on what the values of the features are
sns.relplot(x="lesion_1",y="packed_cell_volume",hue='outcome',row='abdomen',data=df)
# On seeing this kind of distribution in the data we can decide a tree-based model would be better than a linear model
Use fillna() function in pandas for filling missing values. Make sure all the features have correct dtypes.
Numeric: float or int
Categorical: object or int
# Check features have correct dtype
df_dtype_nunique = pd.concat([df.dtypes, df.nunique()],axis=1)
df_dtype_nunique.columns = ["dtype","unique"] # Make sure you understand what we are checking here
df_dtype_nunique
We find all columns have correct dtype
# Calculate number of NaNs in each column
missing_count = df.isnull().sum()
missing_count[missing_count>0].sort_index()
Q3. Fill all NaN values for numeric features with mean.
df.fillna(value=df.mean(),inplace=True) #Since there is no categorical data of int64 dtype, we can do this freely. Had there been categorical data of type int64 we would have to do df.fillna(value=df[numerical_features].mean(),inplace=True)
# Only NaNs in float and int dtype columns are replaced with mean. [See output of df.mean() to understand this]
df.head()
Q4. Fill in NaN values for categorical features with mode.
df.fillna(value=df.mode().loc[0],inplace=True) #Why did we do .loc[0] [See output of df.mode() to understand this]
#Now that NaNs in numerical features have already been filled, here only NaN's of categorical data will be replaced by mode. Beware df.mode() returns mode for all columns.
df.head()
df.isnull().any().any()
Sometimes continuous features are distributed such that the values reside near one central value, but there are sometimes a non-trivial amount of larger or smaller values which may negatively affect the learning algorithm. Therefore it is common to perform a transformation such as log transformation on these features. Let us take an example:
sns.distplot(df['total_protein'],kde = False)
We can see that most of values lie between 0-20, however there are a large amount of data points which are greater than 40. We need to transform this feature.
Q5. Carry out a log transformation on total_protein by applying natural log on all values.
df['total_protein'] = np.log(df['total_protein'])
sns.distplot(df['total_protein'],kde = False)
Q6. Select features from the dataset based on your analysis.
# abdomo_appearance, abdomo_protein, nasogastric_reflux_ph are probably not good features because they have a lot of missing values
# lesion_2 and lesion_3 don't have much variation in their value (mostly 0)
# You can choose to keep all features but it's not recommended because irrelevant or partially relevant features can negatively impact model performance.
numerical_features = ['packed_cell_volume','respiratory_rate','total_protein','rectal_temp','pulse','lesion_1']
categorical_features = ['abdomen','rectal_exam_feces','temp_of_extremities']
X = df[numerical_features+categorical_features]
y = df["outcome"]
In learning algorithms, values are expected to be numeric. However, categorical attributes can provide a lot of information to the model. So the way we incorporate these attributes is by encoding them.
Q7. Encode the categorical variables (Use pd.get_dummies() for one-hot encode).
#TODO
#Ordinal (can also use OrdinalEncoder())
temp_code = {'cold':0,'cool':1,'normal':2,'warm':3}
X['temp_of_extremities'] = X['temp_of_extremities'].map(temp_code)
#One-hot
X = pd.get_dummies(data=X,columns=['abdomen','rectal_exam_feces'])
X.head()
One more step before we move on is converting the categorical labels in 'outcome' to numbers. This is called label encoding, and is done with the help of LabelEncoder. Check the sklearn documentation example for help.
Q8. Using LabelEncoder, encode 'outcome'
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() #Instantiate the encoder
y = le.fit_transform(y) #Fit and transform the labels using labelencoder
y
Q9. Create training and validation split on data. Check out train_test_split() function from sklearn to do this.
from sklearn.model_selection import train_test_split
X_train,X_val,y_train,y_val = train_test_split(X,y,test_size=0.33,random_state=42) #Checkout what does random_state do
Q10. Scale numeric attributes using MinMaxScaler, StandardScaler(Z-score normalization) or RobustScaler. Scale train and validation datasets separately!
#TODO
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_train[numerical_features] = scaler.fit_transform(X_train[numerical_features])
X_val[numerical_features] = scaler.transform(X_val[numerical_features])
# It is important to scale tain and val data separately because val is supposed to be unseen data on which we test our models. If we scale them together, data from val set will also be considered while calculating mean, median, IQR, etc
X_train[numerical_features].head()
Q11. Select 2 classifiers, instantiate them and train them. A few models are given below:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# Initialize and train
clf1 = DecisionTreeClassifier().fit(X_train,y_train)
clf2 = RandomForestClassifier().fit(X_train,y_train)
from sklearn.metrics import accuracy_score #Find out what is accuracy_score
y_pred_1 = clf1.predict(X_val)
y_pred_2 = clf2.predict(X_val)
acc1 = accuracy_score(y_pred_1,y_val)*100
acc2 = accuracy_score(y_pred_2,y_val)*100
print("Accuracy score of clf1: {}".format(acc1))
print("Accuracy score of clf2: {}".format(acc2))
How do we optimize the classifier in order to produce the best results? We need to tune the model by varying various hyperparameters. We can use GridSearchCV to simplify the whole process.
For GridSearchCV, carry out the following steps(We will only do this for one classifier, so choose one of your previous classifiers):
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
#TODO
clf = RandomForestClassifier() #Initialize the classifier object
parameters = {'n_estimators':[10,50,100]} #Dictionary of parameters
scorer = make_scorer(accuracy_score) #Initialize the scorer using make_scorer
grid_obj = GridSearchCV(clf,parameters,scoring=scorer) #Initialize a GridSearchCV object with above parameters,scorer and classifier
grid_fit = grid_obj.fit(X_train,y_train) #Fit the gridsearch object with X_train,y_train
best_clf = grid_fit.best_estimator_ #Get the best estimator. For this, check documentation of GridSearchCV object
unoptimized_predictions = (clf.fit(X_train, y_train)).predict(X_val) #Using the unoptimized classifiers, generate predictions
optimized_predictions = best_clf.predict(X_val) #Same, but use the best estimator
acc_unop = accuracy_score(y_val, unoptimized_predictions)*100 #Calculate accuracy for unoptimized model
acc_op = accuracy_score(y_val, optimized_predictions)*100 #Calculate accuracy for optimized model
print("Accuracy score on unoptimized model:{}".format(acc_unop))
print("Accuracy score on optimized model:{}".format(acc_op))
We have learnt some methods to boost our accuracy. Try messing around with the above functions and bring up model performance.