If you like what we're working on, please  star us on GitHub. This enables us to continue to give back to the community.

Data Validation Testing Checklist

This blog post was written by Tonye Harry as part of the Deepchecks Community Blog. If you would like to contribute your own blog post, feel free to reach out to us via blog@deepchecks.com. We typically pay a symbolic fee for content that's accepted by our reviewers.

Introduction

Between 80% and 90% of the world’s data is unstructured (sensor, text, video, audio, etc), and it can be messy, making it a bit difficult to conform to conventional ways of utilizing this type data in developing ML models for real-world solutions. It is common knowledge in Machine Learning that your model is only as good as the data you feed it. This means that when the source data is not representative of the use case or is faulty, your model’s performance becomes sub-optimal.

To fix or avoid this issue, the idea of validating data comes in handy. Data validation for Machine Learning systems ensures that your source data meets the criteria or expectations for the ML use case. It ensures not just quality and accuracy but also goes a step further to make sure that the data is relevant.

Since this is such an important topic for researchers and companies alike, this article provides a checklist of data validation tests you should consider if you want to reduce the probability of having faulty, unrepresentative data and increase your chances of having optimal model performance.

Let’s dive in!

Checklist For Data Validation

Depending on the complexity of the ML project you are working on, it can get difficult to monitor every stage of the ML lifecycle. This difficulty might be more pronounced when there are many data sources and a jungle of data pipelines to manage. In fact, the time and cost of doing this might take a toll on the organization and it is advised in cases like this to be proactive at every stage of the project’s lifecycle.

To reduce the chances of having challenges at each stage of the ML lifecycle, it is recommended that you focus on testing in each stage of your project, from data collection to deployment. Data validation tests should be done at both pretraining and postraining stages of the ML project cycle. They helps you catch errors at every stage of the project.

Pre-training stage

Issues that might arise here come from either a lack of initial research on the domain area to understand the nuances of the data or there is largely an issue with the source of the data and how it is prepared and audited in the model development process. These problems come in the form of the following:

  • Insufficient quantity or non-representative training data.
  • Change in the structure of the data or its attributes (features), for example, changes in training and production data schemas, data types and formats, distributions, dimensions, outliers, etc. for the ML use case. An imbalance in the dataset can also fit into this category or labeling errors

These issues mostly happen in data pipelines (ETL, ELT, or reversed ETL)  or during the migration of data from one platform to another. Some can even happen after all your research and planning. Here are a few ways you can check them:

  • Set up expectations for each dataset ingested into the system with ML tools like deepchecks or great expectations to ensure that you can easily flag data integrity issues like duplicate values, null values, data type mismatches, etc.

Checking for Data Integrity using deepchecks

Note: This code example uses Python programming language.

Installing Deepchecks and loading data

# If you don't have deepchecks installed yet, run:
import sys
!{sys.executable} -m pip install deepchecks -U --quiet

# or install using pip from your python environment

from deepchecks.tabular import datasets

# load data
data = datasets.regression.avocado.load_data(data_format='DataFrame', as_train_test=False)

Defining a Dataset Object

from deepchecks.tabular import Dataset

# Categorical features can be heuristically inferred, however we
# recommend to state them explicitly to avoid misclassification.

# Metadata attributes are optional. Some checks will run only if specific attributes are declared.

ds = Dataset(dirty_df, cat_features= ['type'], datetime_name='Date', label= 'AveragePrice')

Running deepchecks full tabular data test suite

from deepchecks.tabular.suites import data_integrity

# Run Suite:
integ_suite = data_integrity()
suite_result = integ_suite.run(ds)
# Note: the result can be saved as html using suite_result.save_as_html()
# or exported to json using suite_result.to_json()
suite_result.show()

This produces an output with tests that either “Didn’t Pass”, “Passed” or “Didn’t Run”. It exhasustively does a data integrity check on your tabular data with a few lines of code. The results are shown in this quickstart guide for data integrity checks.

  • For unstructured data e.g images or text, validate the data using ML tools specific to computer vision or Natural Language Processessing (NLP) use cases. Check for unbalanced data especially in the target variables so that you can supplement it with more training data if necessary.
  • Validate your train-test split by checking for data leakages, and feature or label distribution drifts. Also compare the integrity and distributions of data batches entering and the target system.

Using deepchecks Train-Test Validation Suite

# Loading data
from deepchecks.tabular.datasets.classification import lending_club
import pandas as pd

data = lending_club.load_data(data_format='Dataframe', as_train_test=False)
data.head(2)

## Splitting Data to Train and Test

# convert date column to datetime, `issue_d`` is date column
data['issue_d'] = pd.to_datetime(data['issue_d'])

# Use data from June and July for train and August for test:
train_df = data[data['issue_d'].dt.month.isin([6, 7])]
test_df = data[data['issue_d'].dt.month.isin([8])]
## Defining Metadata

categorical_features = ['addr_state', 'application_type', 'home_ownership', \
  'initial_list_status', 'purpose', 'term', 'verification_status', 'sub_grade']
index_name = 'id'
label = 'loan_status' # 0 is DEFAULT, 1 is OK
datetime_name = 'issue_d'

## Creating Dataset Object

from deepchecks.tabular import Dataset

# Categorical features can be heuristically inferred, however we
# recommend to state them explicitly to avoid misclassification.

# Metadata attributes are optional. Some checks will run only if specific attributes are declared.

train_ds = Dataset(train_df, label=label,cat_features=categorical_features, \
                   index_name=index_name, datetime_name=datetime_name)
test_ds = Dataset(test_df, label=label,cat_features=categorical_features, \
                   index_name=index_name, datetime_name=datetime_name)

# for convenience lets save it in a dictionary so we can reuse them for future Dataset initializations
columns_metadata = {'cat_features' : categorical_features, 'index_name': index_name,
                    'label':label, 'datetime_name':datetime_name}
## Running the deepchecks suite

from deepchecks.tabular.suites import train_test_validation

validation_suite = train_test_validation()
suite_result = validation_suite.run(train_ds, test_ds)
# Note: the result can be saved as html using suite_result.save_as_html()
# or exported to json using suite_result.to_json()
suite_result
Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Deepchecks HubOur GithubOpen Source

More details to the above code is contained in deepchecks’ quickstart – Train-Test Validation Suite guide

  • Validate metadata (information about the data) to ensure that information like data types, length, column names, etc, and check if the metadata changes between a test environment to a production environment are consistent.
  • Check if data transformations like joins or data splits are done correctly after each ETL job. This might be done by also utilizing checks for duplicates, null values, any mismatch, etc.
  • Perform regression tests on your database for incoming data. This involves running all test case suites mentioned when there is a change in the data. When this is automated, it makes it very easy to maintain.
  • Check for conflicting labeling which is characterized by identical samples but having different labels. Ensure splitting is done correctly to reduce chances of data leakage and set alerts with your ML tool to detect sampling errors.

Post-Training Stage

Testing Machine Learning models are implicitly linked to data validation because of the direct relationship between data and the performance of your model. At this point, it is assumed that you have completed all the checks above and you are ready to test the model with your dataset. Comparing model performance is based on:

  • The metrics agreed by stakeholders
  • General behavior of your model

Metrics for performance vary for different use cases (e.g classification and regression) but they generally evaluate the skill of the model on a given dataset in a training or production environment. If the model’s performance is optimal, the percentage of right predictions should be high which might mean that the input preprocessed data is appropriate for the use case and meets the criteria for the model to predict correctly. On the other hand, ML models can produce wrong outputs without showing errors creating a blind spot for data science teams.

The following are things you can test or check that the model is generalizing properly with the input dataset:

  • Check the performance on the train and test datasets. Take note of the weak data segments (e.g demographics, location, etc) and correlations etc. After making corrections as you see fit, test the model again to see if there’s any improvement.

Computer Vision (CV) Validation Test

This is part of the object detection validation test tutorial on the deepchecks documentation page showing how to run a deepchecks full suite check on a CV model and its data.

from deepchecks.vision.suites import full_suite

suite = full_suite()
result = suite.run(training_data, test_data, model, device=device)

result.save_as_html('output.html')

result

Fig. 1: Reports of all the checks done on a dataset and model for an object detection use case. Source: Deepchecks

  • Test the model using slice-based learning to better understand the model by taking a closer look at samples of sub-data groups and classes to see how the model performs on the sliced sample.
  • Run an error analysis to check for bias and variance in the data that might be affecting the model’s performance. This can lead you to increase model robustness by adding noise or variations to the dataset to infer the model’s performance if new data is added to the ML system. For example, you can augment your computer vision dataset or add adversarial samples to your NLP dataset.
  • Consider testing the behavior of your model by utilizing, Invariance Test (INV), Minimum Functionality Test (MFT), smoke test, or Directional Expectation Test (DET).
  • Monitor and test for data drift utilizing the Kolmogrov-Smirnov and Chi-squared tests. These tests have varying sensitivity and they enable teams to get alerts when their data shifts.

In addition to all the things you should check, ensure that all your data validation testing techniques and steps are well document to sure that it can be reproducible in the future inevitably reducing the burden of maintenance and errors for scalable ML solutions. Ensure to research all possible validation checks and tools that can easily be integrated into your workflow to enable them. Then choose the most important ones guided by key performance indicators (KPIs) and metrics picked by stakeholders.

Depending on the types of data validation or ml model testing your project requires or you choose, it is advisable to use MLOps tools to monitor aspects of your project, increase transparency (explainability), promote collaboration and automate important tests to save both time and cost. Deepchecks does a great job in data validation and model testing especially for tabular and computer vision data along with a variety of ML models. Try it out and remember to leave a star!

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Deepchecks Hub Our GithubOpen Source

Recent Blog Posts

Reducing Bias and Ensuring Fairness in Machine Learning
Reducing Bias and Ensuring Fairness in Machine Learning
×

Event
Testing your NLP Models:
Hands-On Tutorial
March 29th, 2023    18:00 PM IDT

Days
:
Hours
:
Minutes
:
Seconds
Register Now