Process of Creating an Effective Machine Learning Pipeline

Posted by Prudhvi Vajja on August 28, 2020

To Understand the ML pipeline in an Interactive way check out the Explorer App.

😱: There was a time where I thought ML pipelines are something similar to OOPS in Coding languages which we have to learn to code ml models in a particular way, But that’s not True at all Rather Pipelines are just a thought process on how to think and organize your ML Models. for better debugging and understanding why the model actually works In this Blog let me explain you the Nuts and Bolts of ML Pipeline so that you don’t have to worry about it anymore.

Although ML pipelines are easy to understand it takes weeks or months for a company to come up with a ML Pipeline that meets the requirements Let us see why?

Before you start diving in, building models and predicting the future ask yourself these questions.

Or step-1 of a ML pipeline

ML Problem Framing

What is the Business Problem for this particular scenario ?

What are you trying to solve/Predict?

Do you really need a Machine Learning Approach to solve this Problem?

Here are some tips to know if you really need if you need a machine learning approach.

  • Are there any repeating patterns that you want to understand (Fraud/UnFraudlent Transactions)
  • Do you have required data to understand the patterns. (Good data : Rows = 10X the number of features/Columns you have)
  • Understand the type of prediction that you want to make whether it is a Binary/Multi Classification problem or you need to predict a Continues value(Regression) such as stock price.

Wait……………………………..! you are not ready yet.

Now it’s time to ask Domain expects and test your assumptions (which can again be yourself if you are doing an own project)

The more questions you ask at this stage the better your model is going to be

  • what are the important features that effect your predictions.
  • Are they any feature overlaps?
  • How to test your model (This is not as easy as splitting data)?
  • Questions that can help you understand the domain and problem that you are solving.

Let’s get on with the boring part and dive in into actually predicting the Future.

Data Collection & Integration

No matter from where you or your team collects the data there might be some noise in it. So you should be equipped with tools which can help you clean and integrate the data properly and you should be able to handle all kinds of data types thrown at you.

Although it is not interesting to do but it is useful to produce interesting results. Becoz real data doesn’t consists of numerical and categorical features it consists of garbage and others Indolence.

Here is an elaborated article by Robert R.F. DeFilippi and Real Python on how to get started do check it out.

Get the Status of Data
  • Load data pd.read_{filetype}('filename')
  • Get Shape of data data.shape
  • Get the statistical information of the data data.describe()
  • Know the Data types and info data.info(), data.dtypes
  • Know the count of missing values data.isnull().sum()

Whooo!, My data looks stunning and I can’t wait to run my model and get predictions.

Data Preparation

This is where the true skills of a data scientist comes into picture.

Take a random sample of your training data (Any size that doesn’t take too much time to run) and you really need to dig deep and should be able to answer all your questions such as.

  • What features are there?
  • Does it match your Expectations?
  • Is their enough information to make accurate predictions?
  • Should you remove some of the features?
  • Are there any features that are not represented properly.
  • Are the labels correctly classified in the Training data?
  • What is happening for incorrect data?
  • Are there any missing values or outliers Etc.

If you do your homework properly you will able to build your model without any worries of wrong prediction.

How can we answer these questions programmatically? That takes us to our next step 👇

Data Visualization & Analysis

I love visualization, when you draw your data on a chart it reveals unseen pattrens, outliers and things the we are not able to see until now. Basic plots such as histograms and scatter plots can tell you a lot about your data

Some of the things that you need to know are:

  • Trying to Understand Label/Target summaries.
  • Histograms to detect scale and distribution: Data.plot.hist()
  • Scatter plot to understand the Variance/Correlation and Trend: Data.plot.scatter()
  • Box Plot to detect outliers: Data.boxplot()
  • Get correlation coefficients by Data.corr()(Default is Pearson) and plot the Heat map of these values for better understding the relation between features and target columns by sns.heatmap(Data.corr())
  • Plot to categorical plot onto a FacetGrid: sns.catplot()

You don’t want hurt your model right. If you data contains missing values or outliers are incorrect data types your model will suffer. In order to run your model smootly you need to take the burden off cleaning it.

Things to consider:

  • Imputing missing values with mean, mode Or using non missing values to predict missing values or Dig Deep , Link , Link
  • Taking Care of Outliers by removing or Dig Deep Link , Link
  • Finding Correlated features Etc..

Now That part the you have been waiting so long for Selecting the model that you want to Run.

It is not easy to choose one model as their are wide range of models and all the models have their basic assumptions. choosing the model that aligns with your assumptions of the Dataset is also and Artwork done by Data Scientists.

Models are mainly classified into.

Supervised:
  • Linear Regression, Logistic Regression, SVM, Decision Trees, Random Forests, XGBoost, LightGBM, KNN
Unsupervised:
  • Clustering, K-Means, Hierarchical clustering, Fuzzy C-Means, PCA, SVD, ICA Etc..
Reinforcement Learning:
  • Model Free RL-(TRPO, PPO, A3C, PG), Q-learning -(DQN, QR-DQN, C51, HER), Model Based RL
Deep Learning:
  • NN, CNN, RNN, LSTM, R-CNN, GANS, Etc..

What’s Next Can I run my Model Now………. Yes If you don’t want better results (The Risk that should not be willing to take no matter the cost). These next part is arguably the critical and time consuming part of the machine learning pipeline. It only comes with experience but I will try to explain as much as I can to help you get started. Get Ready .!

Feature Selection & Engineering

The main ideas for Feature engineering comes from asking questions and what you have learned from visualization process above.

Questions worth asking are:

  • Do these features make sense?
  • How can I engineer new features based on visualizations.
  • Can I create new features by combining old features.

Tips for Feature Engineering:

  • Combining similar columns to a single column.
  • Converting categorical into numerical variables such as count of the letters in a word or one hot encoding, LabelEncoder , Etc.
  • Converting date-time into days or months.
  • Scaling the Numerical features so that model find local minima easily.
  • Binning column data into groups.(It also useful for handling outliers)

Dig Deep - ML Mastery , Medium , Kaggle , Elite , Elite2

WOOOOH……Atlast! Finally ready for training huh huh Sort of:

Model Training

Steps for Model Training:

  • Split Data into Train, Dev, Test:
    • Make sure to Randomize your Data to overcome Bias when you run your Model.
    • Create a test set that closely represents the train set (Stratified Test set, Same Date ranges Etc..)
    • Using Cross Validation.
  • Bias and Variance:
    • Under-fitting = Low variance and High Bias = Simple Model
    • Over Fitting = Low Bias and High variance = Complex Model
    • Hyper-parmeter Tuning helps you balance the model bias and variance.

You Did IT……..

It’s time to collect your salary. RUN the model.

Libraries : Mxnet, Scikit-Learn, TensorFlow, PyTorch, Cloud AutoML, Pycaret, Theano, Keras, SciPy, NLTK, MLlib, XGBoost, LightGBM, Many More

Sorry! It’s not done Yet. There are some details to evaluate and interpret which makes you stand out from the crowd such as:

Model Evaluation

There are different metrics to evaluate your model, sometimes you may have to come up with your own metric to evaluate the model performance. Anyways here are some of common metrics that helps you understand what is going wrong in the model or how accurate is the model for real world or to understand why is the model even working in the first place. Choosing the right metric

Don’t just run single model run multiple Regression/Classification models and choose the best one based on the metrics here.

Regression Metrics:
Classification Metrics:
  • Confusion Matrix, Precision, Recall, F1-score, ROC, AUC Dig Deep

The End:

Face it you have framed the problem, Cleaned the data, Did Data visualization, Selected important feature and did feature engineering, You trained your model again and again by tuning the hyper parameters and testing on Dev set what’s left.

Prediction

Model.Predict(Test_Data) And Deploy (This is an Entirely Seprate Topic )

There may be a question still bugging in your mind.

Everything is Fine, But all of this is Just Theory how can I actually implement it.?

Ans: As I said earlier Pipelines are just a thought Process to solve a problem in systematic manner so that it will be easy to debug in the future: There are tools like Amazon SageMaker which takes care of all these steps at high scale and provides with you extra features-and tools to organize, monitor, track your models

TIPS

  • When you want to test multiple models it is also advisable to test the models on small datasets rather than running on entire datasets.
  • Check the important features that helped and model and continue to feature engineering again on this features.
  • Drop the features which are not correlated with the target variable in any way.
  • Ensemble different models for better prediction and randomization using Maximum Voting or Percentages.
  • Train the best model on both train and dev set best testing on test set.
  • Modeling is not just a one time step you need the right executable code to Re-run your model to be updated with the current needs.

Check out this Blog (Yet to Come) were this steps were used to solve a real world example.

Let me known in the comments if you have learned something new today.