🎉 Deepchecks’ New Major Release: Evaluation for LLM-Based Apps!  Click here to find out more 🚀

A Practical Guide to Data Cleaning

Introduction

“Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.”

Working with unclean data can lead to Machine Learning (ML) models with poor performance that can even go undetected. Datasets that contain duplicates may contaminate the training data with the test data or vice versa. Entries with missing values will lead models to misunderstand features, and outliers will undermine the training process – leading your model to “learn” patterns that do not exist in reality.

While data cleaning in Machine Learning may not seem like the most “sexy” task, avoiding to do so  risks creating useless models that waste your time. We recommend developing a thorough framework for dealing with this important stage and using tools and automation to reduce the unnecessary overhead.

Stages of Data Cleaning

Stage 1: Removing Duplicates

Duplicate entries are problematic for multiple reasons. An entry appearing more than once receives disproportionate weight during training. Models that succeed on frequent entries only look like they perform well. Duplicate entries can ruin the split between train, validation, and test sets where identical entries are not all in the same set. This can lead to biased performance estimates that result in disappointing the model in production.

There are many possible causes for duplicate entries in databases, such as processing steps that were rerun anywhere in the data pipeline. While the existence of duplicates hurt the learning process greatly, it is relatively easy to fix. One option is to enforce columns to be unique whenever applicable. Another is to run a script to automatically detect and delete duplicate entries. This can be done easily with Pandas’ drop_duplicates functionality shown in this sample code:

df
  FirstName LastName PhoneNo
0         A        B       1
1         A        B       1
2         A        B       2

df.drop_duplicates(subset=["FirstName", "LastName"])
  FirstName LastName PhoneNo
0         A        B       1
Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Stage 2: Removing Irrelevant Data

Data often comes from multiple sources and there is a significant probability that a given table or database includes entries that do not belong. In some cases, filtering outdated entries will be required. In others, a more complex filtering of the data is necessary.

Stage 3: Fixing Structural Errors

It is not uncommon to see tables, columns, or values with similar names in a single database. Perhaps a data engineer slipped an underscore or a capital letter where it wasn’t supposed to be, and now your data is a mess. Merging these objects will go a long way in making your data clean and ready for learning.

Stage 4: Detecting Outliers

Outlier detection is somewhat complex. It requires a deeper understanding of what the data should look like, and when entries should be ignored because they are inaccurate. Imagine you have a real estate dataset and an extra digit was added to the price of a property. While this kind of error is very easy to make, it can greatly and negatively affect the model’s learning ability.

The first measure in detecting unwanted outliers is to explore the ranges and possibilities for numerical and categorical data entries, like a negative number as the price of a car is definitely an unwanted outlier. Additionally, algorithms for outlier detection or anomaly detection such as KNN or Isolation Forest can be used to automatically detect and remove outliers.

Stage 5: Handling Missing Data

The most important step in ML data cleaning is handling missing data. Missing values can be caused by online forms that were filled out with only required fields, or when versions of forms and tables are changed. In some cases, using the mean or most popular value for missing values is a good approach. For more important features, it might be worth disposing of the entire data entry.

Stage 6: Q&A

Analyze a large chunk of data after the automatic cleaning process.
What percentage of the entries were problematic?
Were there annotation problems, missing values or missing labels?
How about duplicate entries?

Getting a good estimate of the data quality before the learning process is a critical step on the way to creating quality ML models that is often overlooked.

Conclusion

The precise process can be a little different for each dataset, but many of the steps  we have discussed above are relevant to all scenarios.

Further Reading

Data cleansing

What is data cleaning

How to clean data for Machine Learning

Steps to clean data

Machine Learning data cleaning techniques

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo