GridSearch + Pipelines of Multiple models on Multiclass Classification

Ryan Reilly
6 min readJun 21, 2021

In my previous post, I talked about using pipelines from Scikit-learn to automate workflows and make the modeling process more streamlined using cleaner code. I also introduced the concept of using GridSearch in Scikit-learn. GridIn this tutorial, I am going to show you how to use Gridsearch in combination with pipelines for a multiclass classification dataset. This will allow you to not only streamline your model workflows but optimizing your hyperparameters within each model using Girdsearch.

Multiclass Classification Dataset

I will be using a dataset of phone features to predict a phone’s price range. There are 2000 rows in this dataset. Each row represents the features of a phone. The target column in the dataset is called price_range and has values from 0 to 3:

0 (low cost)
1 (medium cost)
2 (high cost)
3 (very high cost)

There are a total of 20 features including the following:

battery_power: Total energy a battery can store in one time measured in mAh
blue: Has Bluetooth or not
clock_speed: speed at which microprocessor executes instructions
dual_sim: Has dual sim support or not
fc: Front Camera mega pixels
four_g: Has 4G or not
touch_screen: Has touchscreen or not

Here is the output of df.info() method to see all the columns:

#   Column         Non-Null Count  Dtype  
--- ------ -------------- -----
0 battery_power 2000 non-null int64
1 blue 2000 non-null int64
2 clock_speed 2000 non-null float64
3 dual_sim 2000 non-null int64
4 fc 2000 non-null int64
5 four_g 2000 non-null int64
6 int_memory 2000 non-null int64
7 m_dep 2000 non-null float64
8 mobile_wt 2000 non-null int64
9 n_cores 2000 non-null int64
10 pc 2000 non-null int64
11 px_height 2000 non-null int64
12 px_width 2000 non-null int64
13 ram 2000 non-null int64
14 sc_h 2000 non-null int64
15 sc_w 2000 non-null int64
16 talk_time 2000 non-null int64
17 three_g 2000 non-null int64
18 touch_screen 2000 non-null int64
19 wifi 2000 non-null int64
20 price_range 2000 non-null int64
dtypes: float64(2), int64(19)
memory usage: 328.2 KB

Libraries

Below are the libraries I used to perform the analysis using gridsearch and pipelines. I have also provided links to the documentation of each item I loaded.

# Read in the data
import pandas as pd
# Scale the data
from sklearn.preprocessing import StandardScaler
# Pipeline, Gridsearch, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
# Plot the confusion matrix at the end of the tutorial
from sklearn.metrics import plot_confusion_matrix
# Classification Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn import svm

First I am going to read in the data and perform a train test split. For the purposes of this tutorial, assume the data has already been cleaned.

df = pd.read_csv('cell_phones.csv')# Set variables for the targets and features
y = df['price_range']
X = df.drop('price_range', axis=1)
# Split the data into training test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42)

Now I am going to construct a pipeline of each type of model I want to train the dataset on. Below I am creating six different pipelines. Each pipeline is creating a workflow of two steps to be done. The first is to scale the data end the second is to instantiate the model to be fit on. I chose these six because these were the models we learned about in Phase 3 of our Bootcamp.

pipe_lr = Pipeline([('scl', StandardScaler()),
('LR', LogisticRegression(random_state=42))])
pipe_dt = Pipeline([('scl', StandardScaler()),
('DT',DecisionTreeClassifier(random_state=42))])
pipe_rf = Pipeline([('scl', StandardScaler()),
('RF',RandomForestClassifier(random_state=42))])
pipe_knn = Pipeline([('scl', StandardScaler()),
('KNN', KNeighborsClassifier())])
pipe_svm = Pipeline([('scl', StandardScaler()),
('SVM', svm.SVC(random_state=42))])
pipe_xgb = Pipeline([('scl', StandardScaler()),
('XGB', XGBClassifier(random_state=42))])

Great, so now we want to create the Grid Search parameters for each model. We are doing this so that we can pass these in the GridSearchCV function that will take a pipeline and test out each parameter value we pass through in the following parameter list. Before I created the grids for each, I created parameter values within lists so that I could pass in the lists, rather than hardcoded values.

param_range = [1, 2, 3, 4, 5, 6]
param_range_fl = [1.0, 0.5, 0.1]
n_estimators = [50,100,150]
learning_rates = [.1,.2,.3]
lr_param_grid = [{'LR__penalty': ['l1', 'l2'],
'LR__C': param_range_fl,
'LR__solver': ['liblinear']}]
dt_param_grid = [{'DT__criterion': ['gini', 'entropy'],
'DT__min_samples_leaf': param_range,
'DT__max_depth': param_range,
'DT__min_samples_split': param_range[1:]}]
rf_param_grid = [{'RF__min_samples_leaf': param_range,
'RF__max_depth': param_range,
'RF__min_samples_split': param_range[1:]}]
knn_param_grid = [{'KNN__n_neighbors': param_range,
'KNN__weights': ['uniform', 'distance'],
'KNN__metric': ['euclidean', 'manhattan']}]
svm_param_grid = [{'SVM__kernel': ['linear', 'rbf'],
'SVM__C': param_range}]
xgb_param_grid = [{'XGB__learning_rate': learning_rates,
'XGB__max_depth': param_range,
'XGB__min_child_weight': param_range[:2],
'XGB__subsample': param_range_fl,
'XGB__n_estimators': n_estimators}]

We are almost there. Now we can use the GridSearchCV function and pass in both the pipelines we created and the grid parameters we created for each model. In this function, we are also passing in cv = 3 for the gridsearch to perform cross-validation on our training set and scoring = ‘accuracy’ in order to get the accuracy score when we score on our test data.

lr_grid_search = GridSearchCV(estimator=pipe_lr,
param_grid=lr_param_grid,
scoring='accuracy',
cv=3)
dt_grid_search = GridSearchCV(estimator=pipe_dt,
param_grid=dt_param_grid,
scoring='accuracy',
cv=3)
rf_grid_search = GridSearchCV(estimator=pipe_rf,
param_grid=rf_param_grid,
scoring='accuracy',
cv=3)
knn_grid_search = GridSearchCV(estimator=pipe_knn,
param_grid=knn_param_grid,
scoring='accuracy',
cv=3)
svm_grid_search = GridSearchCV(estimator=pipe_svm,
param_grid=svm_param_grid,
scoring='accuracy',
cv=3)
xgb_grid_search = GridSearchCV(estimator=pipe_xgb,
param_grid=xgb_param_grid,
scoring='accuracy',
cv=3)

So far we defined a lot of pipelines and grid searches, but we hav’nt fit anything yet. Here is where the cool part comes in. We first define a list of all the grid searches we just created called grids, then we create a for loof to fit all of them.

grids = [lr_grid_search, dt_grid_search, rf_grid_search, knn_grid_search, svm_grid_search, xgb_grid_search]for pipe in grids:
pipe.fit(X_train,y_train)

The above code took about 3 and 1/2 minutes to run. Your run time will vary depending on how many models you are fitting.

Lastly, we can see how each model performed by using .score at the end of the model and passing in the test data. The below code will first create a dictionary of classifier types to be used in the for loop. The for loop will then print out the accuracy of each model and the specific model parameters that worked best from the parameter grids we created above.

grid_dict = {0: 'Logistic Regression', 1: 'Decision Trees', 
2: 'Random Forest', 3: 'K-Nearest Neighbors',
4: 'Support Vector Machines', 5: 'XGBoost'}
for i, model in enumerate(grids):
print('{} Test Accuracy: {}'.format(grid_dict[i],
model.score(X_test,y_test)))
print('{} Best Params: {}'.format(grid_dict[i], model.best_params_))

Below is my output when I ran it. I have bolded the most accurate model. It looks like Support Vector Machines was the highest of .97 which is great! You can also see the best parameters to pass through the model give the list we provided.

Logistic Regression Test Accuracy: 0.866
Logistic Regression Best Params: {'LR__C': 1.0, 'LR__penalty': 'l1', 'LR__solver': 'liblinear'}
Decision Trees Test Accuracy: 0.858
Decision Trees Best Params: {'DT__criterion': 'entropy', 'DT__max_depth': 6, 'DT__min_samples_leaf': 1, 'DT__min_samples_split': 4}
Random Forest Test Accuracy: 0.86
Random Forest Best Params: {'RF__max_depth': 6, 'RF__min_samples_leaf': 2, 'RF__min_samples_split': 5}
K-Nearest Neighbors Test Accuracy: 0.586
K-Nearest Neighbors Best Params: {'KNN__metric': 'manhattan', 'KNN__n_neighbors': 6, 'KNN__weights': 'distance'}
Support Vector Machines Test Accuracy: 0.97
Support Vector Machines Best Params: {'SVM__C': 5, 'SVM__kernel': 'linear'}
XGBoost Test Accuracy: 0.9
XGBoost Best Params: {'XGB__learning_rate': 0.1, 'XGB__max_depth': 5, 'XGB__min_child_weight': 1, 'XGB__n_estimators': 150, 'XGB__subsample': 0.5}

Below is the confusion matrix. You can see the SVM model did a really good job at predicting a cell phone’s price range by looking at the True positives on the diagonal.

plot_confusion_matrix(svm_grid_search, X_test, y_test)

Conclusion

By now you should be able to see how you can test out many models using pipelines along with testing out many hyperparameter values within each model using gridsearch.

Keep in mind there can be some drawbacks to performing gridsearch with pipelines:

  1. The first drawback can be the size of your data, the full dataset I used was small (only 2,000 rows) so each model did not take that long to scale and train. For larger datasets, some of the models would not make sense or take an extremely long time to run.
  2. The second drawback is related to how many grid parameters you pass in your parameter grid. The longer the list of hyperparameters you want to try, the longer it will take to fit on the girdsearchCV function. For instance in the XGBoost pipeline, because I used param_range (range of 6), param_range_fl(range of 3), n_estimators(range of 3), and learning_rates(range of 3), the total number of tests on the model was 6*3*3*3 = 162!

--

--