If you like what we're working on, please  star us on GitHub. This enables us to continue to give back to the community.

Training vs. Validation vs. Test Sets

Perhaps one of the first concepts newcomers learn about in the field of Machine Learning (ML) is the division of data into training, validation, and test sets. While every ML practitioner is familiar with these concepts, there are common misconceptions that may end up hurting the processes of validating and testing your Machine Learning models. These sets are generally defined as:

“Training set: A set of examples used for learning, that is to fit the parameters of the classifier.”

“Validation set: A set of examples used to tune the parameters of a classifier (to choose the number of hidden units in a neural network).”

“Test set: A set of examples used only to assess the performance of a fully-specified classifier.”

Motivation

The incentive in splitting the data into different sets is to avoid memorization and overfitting. Let’s say we want to test if a student in primary school understands the Fibonacci Sequence. If we reveal to them that the sequence starts with 1,1,2,3,5,8,13… and ask what the 5th number of the sequence is, we are not measuring the student’s understanding of the progression, but rather their memory. If the student can apply the rule and tell us the following numbers in the sequence, he will have proved his understanding. A model can over-memorize (overfit) the training data so its training performance will seem to improve when in fact it is deteriorating. The use of an unseen test set ensures that the measured success is not due to overfitting or memorization, thus showing us the expected performance on the actual data that will be used in production.  Estimating and detecting such overfittings are essential aspects of model validation in Machine Learning.

What’s the validation set for?

We now know the importance of keeping a held-out test set to get a pure measurement of our model’s performance, but how do we choose the hyperparameters for our model? How do we compare different models’ performance? Which dataset should be used? Evaluating the training set is not informative enough. If we prefer one model over the other based on its performance on the test set, we risk contaminating the test set and overfitting. These processes should be done using the validation dataset. Error analysis should be done on the validation set for the same reason. Many practitioners do not take time to properly understand this and it poses a risk in  producing a biased estimate for the model’s performance on real-world data.

Determining the Sizes of the Sets

When it comes to determining the sizes of the sets, there is a tradeoff at play. On the one hand, more training data means a more accurate trained model. On the other, a small amount of test/validation data gives us a more biased estimate of the model’s performance. This means there is a need to find balance. For large enough datasets, using a 60-20-20 split or even an 80-10-10 are some of the widely used selections. For small datasets, the test set may not be representative enough, so practices such as the K-fold Cross-validation should be used.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25) # 25% of 80% is 20%

Code snippet for splitting data to training, validation and test sets (source)

Open source package for ml validation

Build Test Suites for ML Models & Data with Deepchecks

Get StartedOur GithubOur Github

Data Leakage

Data Leakage

Just as with a water pipe, there are times when your data may leak from one set to another (source)

When splitting the data, it is important to verify that there is no overlap between the sets. The first measure to take is to remove duplicate entries in your data. In datasets where a single object (e.g., a user) can have multiple data samples (e.g., individual purchases), the best practice is to have all these samples in only one of the datasets. In other words, datasets should be split according to entities and not just by samples. A less trivial type of leakage is when a feature contains excess information. For example, the index of a sample often contains information about the label because of the order in which the entries were added to the database. During inference, however, this feature won’t help with the prediction and may not even exist.

If the data splitting is not done correctly, we risk getting a biased performance analysis that does not reflect real-world results we need. Ensuring there is no leakage is important in the data validation process.

Splitting the Dataset for Time Series Data

Time Series Data is a problem because the data is not divided into well-defined standalone samples. For each point in time, data from history can be used as features, while data from the future can’t. If we arbitrarily separate samples into train and test, we may lose important contextual data for training and for making predictions.

Common practices involve using earlier data for training, and then evaluating on a test set that consists of the later samples. Sklearn’s TimeSeriesSplit provides functionality for such splitting of the data.

import numpy as np
from sklearn.model_selection import TimeSeriesSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])
tscv = TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=1)
for train_index, test_index in tscv.split(X):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

Output:
TRAIN: [0] TEST: [1]
TRAIN: [0 1] TEST: [2]
TRAIN: [0 1 2] TEST: [3]
TRAIN: [0 1 2 3] TEST: [4]
TRAIN: [0 1 2 3 4] TEST: [5]

Code snippet for splitting time series data (source)

In a nutshell, correctly applying the concepts and methodology of splitting the data into train, validation, and test sets is essential in getting an accurate estimate of your model’s performance on unseen data. The common pitfalls to look out for are data leakage, overfitting to the test set, and incorrect handling of datasets with more complex structures such as time series data.

Further Reading

Validation data vs test data

Avoiding data leakage

The Ladder: a reliable leaderboard for ML competitions

Splitting the data for time series

Data validation in Machine Learning

Subscribe to Our Newsletter

Do you want to stay informed? Keep up-to-date with industry news, the latest trends in MLOps, and observability of ML systems.

Related articles

How to Choose the Right Metrics to Analyze Model Data Drift
How to Choose the Right Metrics to Analyze Model Data Drift
What to Look for in an AI Governance Solution
What to Look for in an AI Governance Solution
×

Event
Identifying and Preventing Key ML PitfallsDec 5th, 2022    06:00 PM PST

Days
:
Hours
:
Minutes
:
Seconds
Register NowRegister Now