If you like what we're working on, please  star us on GitHub. This enables us to continue to give back to the community.

Machine Learning Models Are Only as Good as the Data They Are Trained On

This blog post was written by Inderjit Singh Chahal as part of the Deepchecks Community Blog. If you would like to contribute your own blog post, feel free to reach out to us via blog@deepchecks.com. We typically pay a symbolic fee for content that’s accepted by our reviewers.

The quality of a machine learning model is decided largely by the quality of the dataset it was trained on. Perhaps that is the reason, a recent study that compiled responses from a number of machine learning practitioners, highlights that access, preparation, and validation of data has been reported to be one of the most time-consuming components in a machine learning project. The study points to the following distribution for the time taken by different components of a machine learning production cycle:

Source: Citeseerx paper

To further take the argument on the importance of data, the same study highlights that the majority of the respondents have outlined the data collection, preparation and validation steps to be the most critical components in their respective projects. The following figure outlines the responses, in accordance with the criticality of a modeling step


Different Data Validation Techniques and Tools

Now that we have established the importance of data validation for machine learning let us discuss on the various tools and techniques that are available for the same. These techniques are generally classified into two broader categories:

  • Proactive data validation
  • Reactive data validation

As the name suggests for most of the practical implementations we prefer to use proactive data validation as it takes care of issues in the earlier modeling steps and therefore saves on a significant amount of time. It will almost always be the most critical component that is associated with data validation steps. The primary reason for the same is that it is associated with the gold standard dataset, which will form the bedrock for all the downstream decisions such as

  • Feature selection
  • Feature importance calculation
  • Model selection
  • Model validation, and benchmarking, etc

Proactive Data validation Tools and Techniques

Proactive data validation tools and techniques are further classified into:

– Type safety is when we have a middleware that validates the data types and other implementations with integration to the main annotations tool (at source) to prevent errors in the remainder of the downstream tasks.

The tool used for resolution

A small snippet that displayed type safety constraint implemented on any function that serves as a middleware between annotations and data warehouse is:

def wrapper(func):

    def inner(foo,bar):
        for each_arg in func.__code__.co_varnames:
            if not type(eval(each_arg)) == func.__annotations__[each_arg]:
                raise TypeError(f"expected dtype {func.__annotations__[each_arg]} but got {type(eval(each_arg))}")

        return func(foo,bar)

    return inner

def middleware_func(foo: int, bar: str) -> (str,int):
    return "out"

– Schema validation is when we are validating the annotated data on the storage side where we expect to have integers (e.g. coordinates of bounding boxes should be digits), or other data types where we can enforce these data types.

import schemathesis

schema = schemathesis.from_uri("http://example.com/swagger.json")

def test_api(case):

– Label Ambiguity search is the technique that looks for identical samples with different labels. This is likely caused by mislabeled data or where data collection has some features missing.

The tool used for resolution
A sample snippet that can be used as one of the automated data validation tools, using deepchecks library to search for data ambiguity in a data verification pipeline.

from deepchecks.checks.integrity import LabelAmbiguity
from deepchecks.base import Dataset
import pandas as pd

check = LabelAmbiguity()
result = check.run(phishing_dataset)

# Output

  • A/B testing in data validation for machine learning is the controlled experiments used to validate the analytics flow against a hold out gold standard data set that we know with 100% certainty that its accurate and represents the production dataset.

Reactive Data Validation techniques and Tools

  • The freshness testing technique is used to determine how up-to-date the data sources are for the retraining pipelines of a model.

Tools to Remedy

We can use tools like dbt, Tensorflow data validation to validate the health and relevance of the data on some set preconditions to ensure that the data quality validation is thoroughly completed and we are using the most relevant data for production training/fine-tuning pipelines.

 Distribution/Drift Checks is the technique through which we keep track of the changes in the distribution of the incoming data(test or inference data) that might affect a model’s relevance to make predictions.

The tool used for resolution

Deepchecks provides for a very simple way of keeping track of these shifts, the small code snippet below ensures data quality validation and keeps track of any changes in the data distribution

check  = TrainTestDrift()
result  = check.run(train_dataset=train_dataset, test_dataset=test_dataset, model=model)


Source: Deepchecks

– Feedback Loop for Skew Monitoring is used to monitor skew that occurs due to the way the prediction results are presented to the user. For example, if a recommender system is scoring some 100 videos for a user and the user is only presented with the top 10 results, the rest of the 90 out of the hundred is not going to receive any attention and hence will never be part of the feedback loop.

The tool used for resolution

We can see from the definition that this is more of a data pipeline issue, which is the reason why we should visualize the pipelines using Graphviz etc to ensure that we have adequate checks to ensure that we don’t have this skew in our production pipeline. Below is a simple graph generated for such a pipeline using Graphviz


In the above instances we established the importance of data validation in machine learning, looked into the various available tools for different sets of data validation techniques and procedures. A data validation tool can be a sophisticated UI/UX or a small code snippet that keeps track of the changes that are happening in the environment of the model. These tools not only make lives easier by automating most of the mundane tasks associated with the data validation process but are critical in the maintenance of machine learning pipelines in production.

Subscribe to Our Newsletter

Do you want to stay informed? Keep up-to-date with industry news, the latest trends in MLOps, and observability of ML systems.