Creating a high-performance model is not easy. Efforts are usually focused in selecting the Machine Learning algorithm that best explains the data or in tweaking the algorithm parameters to select those that produce the best results. But I think that the secret for improving a Machine Learning model lies in the previous steps, those where the data is engineered before the learning phase.
We can roughly split a typical Machine Learning project in these 8 phases:
- Determine if the question we want to answer is a Machine Learning problem
- Gather the data
- Understand the data
- Clean the data
- Wrangle the data
- Feature Engineering
- Train the model
- Evaluate the quality
In my experience, steps 1 to 6 are the most time-consuming. We can spend up to 80% or 90% of the whole project efforts. There are many reasons why these tasks are so laborious. I will only mention some of them here:
- Data in the real-world sucks. It’s dirty, unstructured, with missing values and errors… and you have to clean, structure and make decisions about errors, missing data, etc.
- Data is expanded in many data sources. You need to reach all of them.
- Each data source has dozens of tables and, as a CleverData colleague says, Machine Learning datasets live in a bi-dimensional world: you have to constrain all data into a single csv file: just one table, with rows and columns.
In this post, I will concentrate on step 4, the cleaning process, and will explain how to use Machine Learning to clean your data. For this purpose, I will use a bank marketing dataset. In a first step, I will be creating a model with “raw” data and evaluate its performance. I will then clean the data and check how the performance of the model changes. I will use the BigML platform in this step by step tutorial. Ok… ready?
1. The data
Our dataset has information about historical bank marketing campaigns and the outcome for each client (subscribed or not a deposit). Our goal is to use this data to create a Machine Learning model that predicts whether a new client will subscribe a deposit that the bank will offer in new marketing campaigns.
Here is how the dataset looks like (if you want to see it, you can download clicking here, but you don’t need to do it to follow this tutorial):
Each of the 4,521 rows in the file has data about a different client and his behavior:
- 4 columns with personal client data (age, job, marital status, education).
- 4 columns with historical business data (has defaulted, credit balance, housing loan, personal loan).
- 7 columns with contact info (type of contact, last contact day and month, contact duration, number of contacts, days passed, previous contacts).
- 1 column with the outcome of the previous campaign.
- 1 column that tells us if the client subscribed or not a deposit in the last campaign. This is the label, that is, what we want to predict for new instances.
The first thing we want to do is upload the dataset into your BigML account. Use the “Create a source from a URL” option along with this URL:
http://cleverdata.io/wp-content/uploads/2016/12/Bank-Marketing-Dataset.csv
You will the 25 first instances in the UI. All the column types have been automatically detected and you don’t need to change anything. After the csv file is uploaded, click 1-CLICK DATASET to create the dataset:
This is how the dataset histograms look like:
Histograms are very useful. I encourage you to play with them to better understand your data. I will only note that the dataset is highly unbalanced. That means that there are 4,000 clients that did not subscribe a new deposit, while only 521 did subscribe it (hover the mouse over the histogram of the last feature “Subscribed deposit”).
2. The base model
Once the dataset is uploaded, we want to create the first model. We will then evaluate it to see its performance. The quality indicators in this evaluation will be our ground to check if the model performs better after cleaning the dataset. That’s why I call it “the base model”.
For evaluating purposes, we need to train the model with only a portion of the data (training dataset), and then check the accuracy of the predictions with the portion of the dataset we didn’t use for the training step (this is the test dataset). For this test dataset we know if the client has subscribed or not the deposit, so we will know if the predictions are right or wrong.
Let’s split the data into a TRAINING and a TEST dataset. We can do it by clicking 1-CLICK TRAINING|TEST menu. This option will randomly select 80% of the data for training and 20% for testing:
In a few seconds, the UI will take you to the training dataset (80% of the whole dataset). Note that the number of instances has been reduced to 3,616 (see image below).
We are now ready to train our model using 1-CLICK MODEL.
The 1-CLICK MODEL option will create a single tree model, with default options. This is enough for our purpose. You can see my tree in the image below. I highlighted one of the the patterns, the one on the right that describes one of the characteristics of the clients that didn’t subscribe the deposit. I suggest you to review the tree and play with the tools in the toolbar for filtering the patterns within a range of cofidence, support, etc.:
3. The base evaluation
¿How can we check if the model is good? The first thing that might come to your mind could be: what if we make predictions for the next marketing campaign, send the campaign and check with reality if the predictions were right? That’s a good idea, but it would last a long time and we want to know the performance before deploying the model in a production environment.
Remember in the last section we split the data into training and testing? This testing dataset is what we are going to use to check the performance of the model, because we know if the client subscribed or not the deposit and we don’t have to wait for the next campaign.
The EVALUATE menu option will do the work for us. It will make predictions using the test dataset and will automatically calculate the performance.
The UI will bring us to the “New Evaluation” screen, where it has automatically placed the training dataset on the left side, and the test dataset on the right side:
Just click Evaluate and see the results:
These are the measures we are going to pick as the starting point. If you want to know what they mean and how to interpret them, hover the mouse over the measure names and you will get a short description of each one. I particularly like the Average Phi (also known as the Matthews Correlation Coefficient) because it’s a good summary of the global performance of the model.
4. Cleaning the data by removing anomalies
This is the nice part of this post. Let’s use Machine Learning to improve the performance of this Machine Learning model.
The Machine Learning technique we are going to use is called Anomaly Detection. It is an unsupervised technique, which means that we don’t have a label in the data and it will look for some structure in the data. In this case, it will look for anomalous rows that do not follow any general pattern. They may be errors, defective data or just outliers that we want to clean.
Navigate back to our original dataset (not any of the splits, the one with 4,521 rows), and click 1-CLICK ANOMALY. This will throw the anomaly detector with the default configuration, that is, it will detect 10 anomalous rows in the dataset:
After a few seconds, you will be watching the top 10 anomalies on the left side (see image below). The algorithm has detected 10 clients in 4,521 that are outliers. If you hover the mouse over any of them on the left side, you will see its characteristics on the right side. I recommend that you take your time to “play” with the values in order to check why they are anomalous. It will let you better understand your data. Here are some questions you might want to answer down in the comments: How many of the anomalies did and did not subscribe the deposit? Does this distribution match with the overall subscribed/not subscribed ratio? Why?
5. The anomaly-free model and the final evaluation
We want to create a new model with a dataset without the detected anomalies. We will then evaluate its performance and compare it with the base-model performance.
Follow these steps to create a new dataset without the anomalies (the above image will help you):
- Click Select all
- Click the button on the left side of Create dataset. Read the tip to see what it does.
- Click Create dataset to create a new dataset without the 10 anomalies.
Great! Check that your anomaly-free dataset has 4,511 instances, that is, ten less than the original one. Then, create a new evaluation with this new dataset following these steps (review sections 2 and 3 in this tutorial if you need help):
- Split the anomaly-free dataset into a training and a test dataset (80/20).
- Use this new training dataset to create a single tree model with the default options.
- Create an evaluation of the new model using the test dataset created in step 1.
The following image shows the comparison for the base and the anomaly-free model:
It looks good, doesn’t it? All measures are better in the removed anomalies model. I’d like to focus on the Average Phi indicator, as it is a summary of all other indicators and has a global sense of performance. Note it has increased up 30%, from 0.36 to 0.47!
6. Summary
In this post, we have learned how a Machine Learning algorithm (the anomaly detector) helps us in cleaning the data. We have also checked how the performance of the model increased just removing 10 rows in more than 4,500 rows of dataset.
Data in the real world has very poor quality and we spend a lot of time to understand it, clean it and transform it into something useful for a Machine Learning algorithm. The good news is that we can take advantage of Machine Learning to reduce the time needed for these tasks and, even more, to automate them.
A lesson learned is that machine learning can not only be used for predictions. It is a great technology to use in the whole process of model creation and improvement. You may find some useful machine learning algorithms for selecting the best predictors, others to check when to retrain your model or to select which parameters to use in an ensemble.