Perhaps one of the first concepts newcomers learn about in the field of machine learning is the division of data into training, validation and test sets. While these are concepts every ML practitioner is familiar with, there are some common misconceptions regarding these terms which may end up hurting the processes of validating your machine learning model and testing machine learning models. Here is how these sets are generally defined:
“– Training set: A set of examples used for learning, that is to fit the parameters of the classifier.
– Validation set: A set of examples used to tune the parameters of a classifier, for example to choose the number of hidden units in a neural network.
– Test set: A set of examples used only to assess the performance of a fully-specified classifier.”
- Brian Ripley, page 354, Pattern Recognition and Neural Networks, 1996
The motivation to split the data into different sets, is to avoid memorization and overfitting. Let’s say we want to test if a student in primary school understands the Fibonacci sequence. If we reveal to them that the sequence starts with 1,1,2,3,5,8,13 and ask what the 5th number of the sequence is, we are not measuring the student’s understanding of the progression but rather their memory. If however, the student can apply the rule and tell us the following numbers in the sequence, he will have proved his understanding. Additionally, a model can over-memorize (overfit) the training data so it’s training performance will seem to improve while in fact it is deteriorating, since it will generalize worse to new examples. The use of an unseen test set ensures that the measured success is not due to overfitting or memorization, and thus we can get a glimpse of the expected performance on the actual data that will be used in production. Estimating and detecting such overfitting is an essential aspect of model validation in machine learning.
What’s the Validation Set For?
We have seen the importance of keeping a held out test set in order to get a pure measurement of our model’s performance. How do we choose the hyperparameters of our model, and how do we compare different model’s performance? Which dataset should be used? As we’ve seen evaluating on the training set is not informative enough. On the other hand, if we prefer one model over another due to its performance on the test set we risk contaminating the test set and overfitting to the test set. Thus these processes should be done using the validation dataset. Error analysis should be done on the validation set as well for the same reason. This is an issue that many practitioners do not understand properly and there is a significant risk of producing a biased estimate for the model’s performance on real-world data.
Determining the Sizes of the Sets
When it comes to determining the sizes of the sets, there is a tradeoff at play. On the one hand, more training data means a more accurate trained model. On the other hand, a small amount of test/validation data, will give us a more biased estimate of the model’s performance. Thus, there is a need to strike the right balance. For large enough datasets, using a 60-20-20 split or even an 80-10-10 split are some widely used selections. For small datasets, the test set may not be representative enough and thus practices such as k-fold cross-validation should be used.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25) # 25% of 80% is 20%
Code snippet for splitting data to training, validation and test sets (source)
Just as with a water pipe, there are times when your data may leak from one set to another (source)
When splitting the data, it is important to verify that there is no overlap between the sets. A first measure to take is to remove duplicate entries in your data. Furthermore, in datasets where a single object (for example, a user) can have multiple data samples (for example, individual purchases), a best practice is to have all these samples in only one of the datasets. In other words, datasets should be split according to entities and not just by samples. Another less trivial type of leakage is when a feature contains more information than it should. For example, the index of a sample often contains information about the label because of the order in which the entries were added to the database. During inference, on the other hand, this feature won’t help with the prediction and it may not even exist.
If the data splitting is not done correctly, we risk getting a biased performance analysis that does not reflect the real world results we will achieve. Ensuring that there is no leakage is an important component in the data validation process.
Splitting the Dataset for Time Series Data
Time series data poses a problem since the data is not divided into well-defined standalone samples. For each point in time, the data from the history can be used as features, while the data from the future cannot be used. If we arbitrarily separate samples into train and test, we may lose important contextual data for training and for making predictions.
Thus, common practices involve using earlier data for training, and then evaluating on a test set that consists of the later samples. Sklearn’s TimeSeriesSplit provides functionality for such splitting of the data.
import numpy as np from sklearn.model_selection import TimeSeriesSplit X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]]) y = np.array([1, 2, 3, 4, 5, 6]) tscv = TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=1) for train_index, test_index in tscv.split(X): print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] Output: TRAIN:  TEST:  TRAIN: [0 1] TEST:  TRAIN: [0 1 2] TEST:  TRAIN: [0 1 2 3] TEST:  TRAIN: [0 1 2 3 4] TEST: 
Code snippet for splitting time series data (source)
To sum it all up, correctly applying the concepts and methodology of splitting the data into train, validation and test sets is essential in order to get an accurate estimate of your model’s performance on unseen data. Common pitfalls involve data leakage, overfitting to the test set and incorrect handling of datasets with more complex structures such as time series data.