The importance of cleaning, selecting and transforming data

by Andrés González

When people first see a demo of a Machine Learning product, there is a general feeling that it is something magical. Probably because it is a different way of seeing how computers work. Instead of using closed code, which behaves like a calculator that always throws the same result when the input data is identical, a Machine Learning system works with the patterns it discovers from the data as it’s fed with them. These are dynamic programs that change over time and from which you may not obtain the same results over time even when the input data is the same.

Opposite of what we might call “traditional” programming, a Machine Learning-based system adjusts to a changing environment, adapting to new situations as it is fed with fresh data.

However, it is important to understand that predictive analytics is not magic, and although the algorithm learns, it can only extract valuable information from the data we provide. Algorithms do not have the same intuitive capacity as humans, whether this is good or bad, and the success of the system depends mainly on the input data.

The trick of the “demos” you can find is that the data is already selected, cleaned and transformed so the algorithm can “easily” discover the patterns. The entire process of generating a predictive model, from data capture to prediction, is equivalent in this approach to cooking a delectable dish. The ingredients would be the data, and the recipe the algorithm: if the ingredients are in bad condition, no matter how good the recipe is, the dish will not turn out well. Equivalently; if it is not good quality data (i.e. not well selected, clean and transformed), even the best algorithm will give us poor quality predictions.


From data to algorithms

Let’s take a brief look at the previous data preparation process. It is common to face a predictive analysis project with a lot of data. A lot is a lot. The first task is to collect them all. They are usually in different repositories:

  • The company’s CRM.
  • SQL (or non-SQL) databases.
  • Spreadsheets.
  • Social networks.
  • In the business billing program.
  • Email list management program.
  • Bank transaction reports.
  • In someone’s head…

Often these data are “dirty”, that means, they have errors or discrepancies between the different fields in different databases. For example, the letter “ñ” or the accents may be encoded in different formats depending on where we have collected the data. The data cleansing phase includes, among other tasks:

  • Match formats.
  • Discard fields.
  • Correct spelling mistakes.
  • Format dates.
  • Remove duplicate columns.
  • Delete unusable records.
  • Treat missing data.

With the “clean” data you can start to select what will be useful to make your predictions. At this stage you have to keep the “signal” and remove the fields that provide “noise”. This part of the process is usually called Feature Engineering:

The data transformation, which also belongs to the so-called Feature Engineering, tries to generate new predictor fields based on the ones we already have knowledge of the domain (of the business, of the area being analyzed) is essential to tackle this phase. This, and the phase of selecting predictive fields, are the ones that require the most intellectual and creative effort, since it is not only necessary to know the field of study, but it is also necessary to know with a certain depth how predictive algorithms work, how they interpret the data internally and how are the relationships between them.

As an example, you might think that in a churn prediction project it is enough to have both, the acquisition and the retirement dates available. We could interpret the algorithm, by analyzing these two data, as being capable of “deducing” the client’s seniority. But that is not the case. The transformation in this case very simple: it would be to add a new field that would be the subtraction of the two dates and transform it into number of days (or months, or years, depending on whether we consider it better). A small modification like this can greatly improve the predictive capability of the system.


Machine Learning As A Service (MLAAS) platforms are bringing predictive analysis (as opposed to descriptive analysis) substantially closer to businesses of all sizes. What the big ones have been doing for years, is now becoming generalized to all companies. The process we are going through reminds us of the evolution of databases in the 1980s and 1990s: what was initially difficult to explain (how they work and what they were for) is now so integrated into all systems that it is difficult to find a single company that does not have a database in its core.

The algorithms are important, but they’re not the most important thing. The preliminary phase of data collection and preparation requires minimal effort and knowledge in order to carry out a successful project. This phase can take between 80% and 90% of the project time.

Factors such as experience, intuition, business and customer knowledge are basic. You can have these skills in your company, or maybe you need to hire them outside 🙂

Original article (in Spanish): “La importancia de limpiar, seleccionar y transformar los datos

Translation: Sergio Paul Ramos Moreno