How to save Machine Learning models in Python using Pickle and Joblib
When building a Machine Learning model, there’s always a need to save the trained model to a file to minimize or completely avoid the stress of training the model all over again when required. This saved model can be used and reused whenever required in a program.
Solving a machine learning problem consists of 2 basic steps; training the model and making predictions with the trained model.
This article covers a step-by-step approach on how to save a Machine Learning model in Python using Pickle and Joblib.
Content Outline:
1. Getting the data
2. Data preprocessing
3. Visualizing the data on a pair plot
4. Training the model
5. Using Pickle and Joblib to save the trained model
6. Making predictions with the saved model.
Getting the data
Usually in Machine Learning, the size of train/test sets are pretty huge, consisting of a large number of rows and columns which helps the trained model to be more accurate during prediction.
In this article, we shall be working on a small multivariate dataset that was gotten by scraping a property-sales website in Nigeria and calling it ‘lekki_house_pricing.csv’.
Here’s the link to the dataset.
Data preprocessing
Computing the summary of statistics pertaining to the DataFrame columns.
Here, the standard deviation for the bedrooms and toilets features is zero which indicates no variation in their data points, this makes them irrelevant to the data set. Therefore, we remove them from the data.
This leaves us with 3 columns; bathrooms, parking space, and price.
Visualizing the data on a pairplot
To get an idea of the distribution of data points, we make a visualization using the seaborn pairplot.
Training the model
To train the model, there’s a need to separate the data set into features and label — X and y respectively.
We shall be training our model to predict home prices using the Linear Regression technique.
Predicting the prices of two houses with different numbers of bedrooms and parking spaces using our trained model.
We can see our model predicts a price of about 56 million naira for 5 bedrooms with 4 parking spaces, while for 4 bedrooms with 5 parking spaces our model predicted a price of about 63.5 million naira.
Using Pickle and Joblib modules to save the trained model
Using Pickle
Pickle is a python module used for serializing and de-serializing python object structures. The process of converting python objects such as lists, dictionaries, etc. into byte streams (0s and 1s) to store it in a file/database is referred to as ‘pickling’ or ‘serialization’.
Running the code snippet above saves the model as a binary file which can be seen in the working directly. We can see this in the image below.
Using Joblib
Technically, sklearn’s Joblib module does primarily the same thing as the Pickle module.
The code snippet above saves the model into a file called ‘joblib_model’. Just like the Pickle module, the file can be seen in the working directly as can be seen in the image below.
Making predictions with the saved model
Using the Pickle model
To use the saved model, we open the file in read mode and load it into a model object, here we call the model object ‘pickle_file’. This model object is used to make predictions.
Using the Joblib model
Just like in Pickle, we load the file into a model object, here we call the model object ‘Joblib_file’ and use it for predictions.
Conclusion
Most experts say Joblib is significantly faster on large NumPy arrays internally than Pickle. I have not done any profiling myself to ascertain which is better but I advise you to figure out which works for you, following the needs of the project as none of these represent an optimal solution.
If this was helpful, feel free to connect with me on my Twitter and LinkedIn media.
Happy Data Science-ing!
Reference