“Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.”
- Data cleansing, Wikipedia
Working with unclean data can lead to ML models with poor performance, which can even go undetected in some scenarios. Datasets that contain duplicates may contaminate training data with the test data or vice versa, entries with missing values will lead models to misunderstand features and outliers will undermine the training process – leading your model to “learn” patterns that do not exist in reality.
While data cleaning in machine learning may not seem like the most “sexy” task, if you try to avoid or skip this step you risk creating useless models and wasting you and your team’s time. Thus, we recommend developing a thorough framework for dealing with this important stage and using tools and automation to reduce the unnecessary overhead.
Stages of Data Cleaning
Following are some of the central machine learning data cleaning steps
Duplicate entries are problematic for multiple reasons. First off, when an entry appears more than once, it receives a disproportionate weight during training. Thus models that succeed on frequent entries will look like they perform well, while in reality this is not the case. Additionally, duplicate entries can ruin the split between train, validation and test sets in cases where identical entries are not all in the same set. This can lead to biased performance estimates that will lead to disappointing models in production.
There are many possible causes for duplicate entries in databases such as processing steps that were rerun anywhere in the data pipeline. While the existence of duplicates can hurt the learning process greatly, it is relatively easy to fix. One option is to try to enforce columns to be unique whenever applicable. Another complementary option is to run a script to automatically detect and delete duplicate entries. Using pandas, this can be done easily with drop_duplicates functionality, as in the following code example:
df FirstName LastName PhoneNo 0 A B 1 1 A B 1 2 A B 2 df.drop_duplicates(subset=["FirstName", "LastName"]) FirstName LastName PhoneNo 0 A B 1
Removing irrelevant data
Data often comes from multiple sources, and there is a significant probability that a given table or database includes entries that do not really belong. In some cases filtering out outdated entries will be required, in others a more complex filtering of the data may be necessary.
Fixing structural errors
It is not uncommon to see tables, columns or values with similar names in a single database. Perhaps a data engineer slipped an underscore or a capital letter where it wasn’t meant to be, and now your data is a mess. Merging these objects will go a long way to making your data clean and ready for learning.
Outlier detection is a somewhat complex task. It requires a deeper understanding of what the data should look like, and when entries should be ignored because they are inaccurate. Imagine you have a real estate dataset and an extra digit was added to the price of a property. While this kind of error is very easy to make, it can greatly affect the model’s learning ability in a negative manner.
A first measure to detect unwanted outliers is to explore the ranges and possibilities for numerical and categorical data entries. For example, a negative number as the price of a car is definitely an unwanted outlier. Additionally, algorithms for outlier detection or anomaly detection such as KNN or isolation forrest can be used to detect and remove outliers automatically.
Handling missing data
Perhaps the most important of machine learning data cleaning steps is handling missing data. Missing values can be caused by online forms that were filled out with only required fields, or when versions of forms and tables are changed for example. In some cases, using the mean or most popular value for missing values is a good approach, for more crucial features, it might be worth disposing of the entire data entry.
Analyze a large chunk of data after the automatic cleaning process. What percentage of the entries were problematic? Were there annotation problems, missing values or missing labels? How about duplicate entries?
Getting a good estimate of the data quality before the learning process is a critical step on the way to creating quality ML models that is often overlooked.
In this post we tried to show why data cleaning is important in machine learning, along with some of the basic steps in the data cleaning process. The precise process can be a little different for each dataset, but many of the steps are relevant to all scenarios.