Introduction
Perhaps one of the first concepts newcomers learn about in the field of Machine Learning (ML) is the division of data into training, validation, and test sets. While every ML practitioner is familiar with these concepts, there are common misconceptions that may end up hurting the processes of validating and testing your Machine Learning models. These sets are generally defined as:
“Training set: A set of examples used for learning, that is to fit the parameters of the classifier.”
“Validation set: A set of examples used to tune the parameters of a classifier (to choose the number of hidden units in a neural network).”
“Test set: A set of examples used only to assess the performance of a fully-specified classifier.”
- Brian Ripley, page 354, Pattern Recognition and Neural Networks, 1996
Motivation
The incentive in splitting the data into different sets is to avoid memorization and overfitting. Let’s say we want to test if a student in primary school understands the Fibonacci Sequence. If we reveal to them that the sequence starts with 1,1,2,3,5,8,13… and ask what the 5th number of the sequence is, we are not measuring the student’s understanding of the progression, but rather their memory. If the student can apply the rule and tell us the following numbers in the sequence, he will have proved his understanding. A model can over-memorize (overfit) the training data so its training performance will seem to improve when in fact it is deteriorating. The use of an unseen test set ensures that the measured success is not due to overfitting or memorization, thus showing us the expected performance on the actual data that will be used in production. Estimating and detecting such overfittings are essential aspects of model validation in Machine Learning.
What’s The Validation Set For?
We now know the importance of keeping a held-out test set to get a pure measurement of our model’s performance, but how do we choose the hyperparameters for our model? How do we compare different models’ performance? Which dataset should be used? Evaluating the training set is not informative enough. If we prefer one model over the other based on its performance on the test set, we risk contaminating the test set and overfitting. These processes should be done using the validation dataset. Error analysis should be done on the validation set for the same reason. Many practitioners do not take time to properly understand this and it poses a risk in producing a biased estimate for the model’s performance on real-world data.
Determining The Sizes Of The Sets
When it comes to determining the sizes of the sets, there is a tradeoff at play. On the one hand, more training data means a more accurate trained model. On the other, a small amount of test/validation data gives us a more biased estimate of the model’s performance. This means there is a need to find balance. For large enough datasets, using a 60-20-20 split or even an 80-10-10 are some of the widely used selections. For small datasets, the test set may not be representative enough, so practices such as the K-fold Cross-validation should be used.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25) # 25% of 80% is 20%
Code snippet for splitting data to training, validation and test sets (source)
When it comes to determining the sizes of the sets, there is a tradeoff at play. On the one hand, more training data means a more accurate trained model. On the other, a small amount of test/validation data gives us a more biased estimate of the model’s performance. This means there is a need to find balance. For large enough datasets, using a 60-20-20 split or even an 80-10-10 are some of the widely used selections. For small datasets, the test set may not be representative enough, so practices such as the K-fold Cross-validation should be used.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25) # 25% of 80% is 20%
Code snippet for splitting data to training, validation and test sets (source)
Data Leakage

Just as with a water pipe, there are times when your data may leak from one set to another(source)
When splitting the data, it is important to verify that there is no overlap between the sets. The first measure to take is to remove duplicate entries in your data. In datasets where a single object (e.g., a user) can have multiple data samples (e.g., individual purchases), the best practice is to have all these samples in only one of the datasets. In other words, datasets should be split according to entities and not just by samples. A less trivial type of leakage is when a feature contains excess information. For example, the index of a sample often contains information about the label because of the order in which the entries were added to the database. During inference, however, this feature won’t help with the prediction and may not even exist.
If the data splitting is not done correctly, we risk getting a biased performance analysis that does not reflect real-world results we need. Ensuring there is no leakage is important in the data validation process.
Splitting the Dataset for Time Series Data
Time Series Data is a problem because the data is not divided into well-defined standalone samples. For each point in time, data from history can be used as features, while data from the future can’t. If we arbitrarily separate samples into train and test, we may lose important contextual data for training and for making predictions.
Common practices involve using earlier data for training, and then evaluating on a test set that consists of the later samples. Sklearn’s TimeSeriesSplit provides functionality for such splitting of the data.
import numpy as np from sklearn.model_selection import TimeSeriesSplit X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]]) y = np.array([1, 2, 3, 4, 5, 6]) tscv = TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=1) for train_index, test_index in tscv.split(X): print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] Output: TRAIN: [0] TEST: [1] TRAIN: [0 1] TEST: [2] TRAIN: [0 1 2] TEST: [3] TRAIN: [0 1 2 3] TEST: [4] TRAIN: [0 1 2 3 4] TEST: [5]
Code snippet for splitting time series data (source)
In a nutshell, correctly applying the concepts and methodology of splitting the data into train, validation, and test sets is essential in getting an accurate estimate of your model’s performance on unseen data. The common pitfalls to look out for are data leakage, overfitting to the test set, and incorrect handling of datasets with more complex structures such as time series data.