Automate your Workflows using Pipelines in Scikit-learn

Ryan Reilly
5 min readMay 31, 2021

One of the key skills we have learned at Flatiron during our data engineering phase is the ability to automate data processing through the use of loops and functions. Loops allow us to repeat workflows while functions allow us to build reusable code. This process of automation can be extended to thinking about the workflows involved in creating machine learning models. The Scikit-learn library provides a Pipeline module to do just that. It allows you to chain transformers and estimators together into a sequence that functions as one object in your code.

So what are transformers and estimators anyway?

A transformer is a way to “transform” your dataset into a desirable state to be used in a machine learning model. You can use the transform() method in your pipeline to clean your data, impute data, create or reduce the number of features, and even scale your data, among others.

An estimator refers to a machine learning model. Once you transform your data, you will need to fit() and predict() on the model of your choosing, depending on what you are trying to accomplish. You can see in the below chart that choosing the right model can be a daunting task, as there are many estimators to chose from.

Algorithm cheat sheet from Scikit-learn

Once you choose your model, you can set up your code to allow all of your transformations and estimators to happen in one box of code, kind of like the pipeline box below. It shows a picture of both the testing and training data flow with the workflow(pipeline) structure.

Copywrite: Sebastian Raschka

In the context of pipelines, both your transformers and your estimator are referred to as estimators, which are then passed into a list called estimators using key-value pairs where the string is the name of the step and the value is an estimator object. All estimators in a pipeline, except the last one, must be transformers. Below is a basic construction of pipeline objects in Scikit-learn.

Construction of a pipeline from Sci-kit learn

Here is an example of comparing multiple models using Pipelines:

Code was taken from the YouTube walkthrough at the bottom of this post under Resources

There are other benefits other than having your transformers and estimators in a nice sequential order. A common time-suck in fine-tuning machine learning models is updating the hyperparameters of the model. Using pipelines, along with the Grid search method, allows you to tune the hyperparameters of the entire pipeline including both transformers and estimators.

What are hyperparameters and GridSearch?

Model parameters are internal to a model and can be estimated from the data. Model hyperparameters are external to the model and must be set manually. Finding the correct hyperparameter value is done by setting different values for those hyperparameters in your model and deciding which ones work best by testing them. Performing a grid search is the process of scanning the data to configure optimal parameters for a given model. The Gridsearch tool allows you to apply this search for the optimal hyperparameter values on your pipeline at once.

Alternative Pipeline tools

Spark ML: MLlib fits into Spark’s APIs and interoperates with NumPy in Python. A description from the spark website: “ The spark.ml package aims to provide a uniform set of high-level APIs built on top of DataFrames that help users create and tune practical machine learning pipelines. See the algorithm guides section below for guides on sub-packages of spark.ml, including feature transformers unique to the Pipelines API, ensembles, and more.”

Weka: Weka is another machine learning software that has its own tools for managing machine learning pipelines. Weka’s functionality can be accessed from Python using the Python Weka Wrapper.

TPot: Tpot was developed in the Computational Genetics Lab at the University of Pennsylvania. This tool was actually built on top of Scikit-learn learn and optimizes machine learning pipelines using genetic programming. A little more on genetic programming here.

MLBlocks: MLBlocks is a simple framework for composing end-to-end tunable Machine Learning Pipelines by seamlessly combining tools from any python library with a simple, common, and uniform interface.

These were just a few, there are many alternatives to sci-kit learn that have their own machine learning pipeline tool underneath the hood. Here was a top 20 alternative list I found from G2.

History

The Scikit-learn library was started in 2007 as a Google Summer of Code program. There were 13 students between 2007 and 2016 who were sponsored by Google to work on Sci-kit learn. It currently sits as an open-source community-driven project with a core list of contributors for development and maintenance. It is funded by institutional and private grants from big companies (Microsoft included). The GitHub repository for the pipeline module within Scikit-learn has 60 contributors.

Resources

Below are some resources for using pipelines in Ski-kit learn along with some simple tutorials:

Pipeline Function: This Scikit-learn page provides a simple example of using the pipeline function along with the various methods used with the function. There is also a list of links at the bottom that lead to more examples of where this is used.

Pipeline User Guide: This is Scikit-learns a more comprehensive look at pipelines and how they can be used in conjunction with other tools during the transformation and training of your models.

Basic Tutorial: This post gives a really good overview of a basic pipeline implementation. The writer even shows you how you can easily stack multiple pipelines to find the best model along a tuning pipeline that helps you find the best hyperparameters in a given model.

Basic Tutorial: Here is another great tutorial that goes over a simple example of using a couple of transformations along with a classifier (estimator) to indicate if an email is considered spam or not. It also shows how you can create your own function to be used as a transformation step in a pipeline.

Advanced Tutorial: This post in Kaggle is a little more advanced but it illustrates the other features that can be used with the pipeline module to make your machine learning models train and estimate more efficiently.

Youtube walkthrough: I found this video to be very helpful. When first hearing about pipelines, they sound very abstract, but this guy goes over how to use them in this short video and does a great job of clearly explaining how the code works.

--

--