“Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.”
- Data cleansing, Wikipedia
Working with unclean data can lead to Machine Learning (ML) models with poor performance that can even go undetected. Datasets that contain duplicates may contaminate the training data with the test data or vice versa. Entries with missing values will lead models to misunderstand features, and outliers will undermine the training process – leading your model to “learn” patterns that do not exist in reality.
While data cleaning in Machine Learning may not seem like the most “sexy” task, avoiding to do so risks creating useless models that waste your time. We recommend developing a thorough framework for dealing with this important stage and using tools and automation to reduce the unnecessary overhead.
Stages of Data Cleaning
Stage 1: Removing Duplicates
Duplicate entries are problematic for multiple reasons. An entry appearing more than once receives disproportionate weight during training. Models that succeed on frequent entries only look like they perform well. Duplicate entries can ruin the split between train, validation, and test sets where identical entries are not all in the same set. This can lead to biased performance estimates that result in disappointing the model in production.
There are many possible causes for duplicate entries in databases, such as processing steps that were rerun anywhere in the data pipeline. While the existence of duplicates hurt the learning process greatly, it is relatively easy to fix. One option is to enforce columns to be unique whenever applicable. Another is to run a script to automatically detect and delete duplicate entries. This can be done easily with Pandas’ drop_duplicates functionality shown in this sample code:
df FirstName LastName PhoneNo 0 A B 1 1 A B 1 2 A B 2 df.drop_duplicates(subset=["FirstName", "LastName"]) FirstName LastName PhoneNo 0 A B 1
Stage 2: Removing Irrelevant Data
Data often comes from multiple sources and there is a significant probability that a given table or database includes entries that do not belong. In some cases, filtering outdated entries will be required. In others, a more complex filtering of the data is necessary.
Stage 3: Fixing Structural Errors
It is not uncommon to see tables, columns, or values with similar names in a single database. Perhaps a data engineer slipped an underscore or a capital letter where it wasn’t supposed to be, and now your data is a mess. Merging these objects will go a long way in making your data clean and ready for learning.
Stage 4: Detecting Outliers
Outlier detection is somewhat complex. It requires a deeper understanding of what the data should look like, and when entries should be ignored because they are inaccurate. Imagine you have a real estate dataset and an extra digit was added to the price of a property. While this kind of error is very easy to make, it can greatly and negatively affect the model’s learning ability.
The first measure in detecting unwanted outliers is to explore the ranges and possibilities for numerical and categorical data entries, like a negative number as the price of a car is definitely an unwanted outlier. Additionally, algorithms for outlier detection or anomaly detection such as KNN or Isolation Forest can be used to automatically detect and remove outliers.
Stage 5: Handling Missing Data
The most important step in ML data cleaning is handling missing data. Missing values can be caused by online forms that were filled out with only required fields, or when versions of forms and tables are changed. In some cases, using the mean or most popular value for missing values is a good approach. For more important features, it might be worth disposing of the entire data entry.
Stage 6: Q&A
Analyze a large chunk of data after the automatic cleaning process.
What percentage of the entries were problematic?
Were there annotation problems, missing values or missing labels?
How about duplicate entries?
Getting a good estimate of the data quality before the learning process is a critical step on the way to creating quality ML models that is often overlooked.