Evaluating Model Performance Using Validation Dataset and Cross-validation Techniques


Machine Learning models may provide accurate predictions in training, yet underperform when applied in production. How do you guard against it? Read this article to learn more.

A major challenge in Machine Learning is maintaining the predictive power of a model when it sees new data. It is not uncommon for a model to underperform on some new data it had not seen before; it just does not perform as well as it had during the initial training. This may happen for different reasons: model overfitting, unrepresentative data samples, or simply the stochastic nature of the algorithm, to name a few. Inadequate evaluation of a model may put an underperforming algorithm into production or hold it back under the assumption it has been overfitted.

In this article, we discuss evaluating the performance of Machine Learning models in the context of validation techniques: what it is, how it works, and the primary caveats. We explain cross-validation, the most widely used performance evaluation method, as well as the danger of data leakage in cross-validation and ways to avoid it.

Performance Evaluation of a Machine Learning Model

Let’s start with the fundamentals. The objective of performance evaluation in MachineLlearning is to understand how the model may perform on new data it has not yet seen.

The model training process creates a model that predicts accurately by understanding the data at hand. However, the model may overfit the data. This happens because, instead of understanding the underlying structure in the data, it “memorizes” facts specific to that dataset. The problem surfaces when the model is put into production with data it has not seen before.

Validation techniques exist for evaluating the performance of a model on different data splits to mitigate problems like this as early as possible. While there are several ways to do this, they share fundamental principles.

The Three-Way Holdout Method

One of the most fundamental validation methods for model evaluation is the Three-way Holdout Method. It has three stages, each with a corresponding dataset:

  • Training set: Used for deriving the Machine Learning algorithm to capture the relationships in the data.
  • Validation set: Used for an unbiased evaluation of the model fit during hyperparameter tuning, model selection, and error analysis.
  • Test set or Hold-out set: Used for the final, independent evaluation with data not seen by the algorithm during the training and validation processes.

These three phases and datasets are the building blocks of the Three-way Holdout Method. There are different terms for these stages and datasets depending on the context. However, the basic principles are the same regardless of the terminology.

The steps of the Three-way Holdout Method are:

  1. Split the data into training, validation, and test sets.
  2. Train the Machine Learning algorithm on the training set with different hyperparameter settings.
  3. Evaluate the model performance on the validation set and select the hyperparameters with the best performance on this validation set. This step is sometimes combined with the previous hyperparameter tuning step by fitting a model and calculating its performance on the validation dataset before moving to the next model.
  4. [Optional] Train a new model on the combined training and validation set, using the best hyperparameter values from the previous step.
  5. Test the model on the independent hold-out set.
  6. Retrain the model on all the data and use the resulting model in production.

Guidelines for the Three-Way Holdout Method

The Three-way Holdout Method is at the heart of most validation techniques for model performance evaluation. Here are some guidelines to keep in mind:

  • Do not use the training error for evaluation. The training error is the prediction error on the same data used to train the model. It can be severely misleading. As an extreme example, a 1-Nearest Neighbor model produces 0% training error on a randomly generated sample from the training set by identifying the original data points as their own nearest neighbors. Relying on this result does not provide any information about the performance of this model on a dataset it has not yet seen.
  • Avoid overlap between sets. It is important to remove duplicates and partial duplicates from the data before splitting and to ensure identical or related samples do not belong simultaneously to different sets. Even samples that are not entirely identical but still closely related carry information about each other, so they should not belong to the training and test sets at the same time (e.g., pictures taken in the same situation in an object detection context).
  • Use the test set only for the final evaluation. Hold off the assessment with the test set until the training phase is complete. Even if it is only to make decisions about the evaluation pipeline, it can lead to information leakage. Tweaking a model based on how it performs on the test set compromises the testing step in evaluating the model on new data not yet seen by the model.
  • Look out for sampling bias. The selection of the training, testing, and validation sets needs to be free of bias that makes specific groups more or less likely to be in one set than others. For example, if the data is sorted by department, the training, validation, and test sets may contain different distributions of department-specific samples. Without shuffling the data first, the model may learn behaviors that are less representative; as a result, the model underperforms.

Data Scarcity in Machine Learning Performance Evaluation

Often there is only a limited number of samples, and generating labeled data may be expensive, slow, or even undesirable (e.g., when predicting customer churn or system failure). Splitting the data into training, test, and validation sets involves effectively distributing the available—clean and labeled—data resources among them.

As a rule of thumb, all three stages benefit from more data. Consequently, they compete for the same data resources. Using too much data for testing may be a waste of information from the training point of view; on the other hand, keeping the test dataset too thin makes the final evaluation less effective.

Using advanced techniques such as cross-validation to make the most out of the available data can help address this problem in the model performance evaluation process.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Cross-Validation in Machine Learning

Cross-validation is a technique for model performance evaluation in Machine Learning. It uses different subsamples of the data to train and evaluate the model by running multiple iterations. Each iteration splits the data into different training and validation folds (or subsamples) and repeats model training and validation on them.

With a single hold-out set, it is difficult to know the degree to which the training and test sets reflect each other and thus the stability of the model over all the data. Cross-validation can provide more accurate feedback about the performance of a model by improving on the three-way holdout method.

The multiple training-validation rounds allow a wider inclusion of data samples, calculating the average and standard error for each fold. This provides a more comprehensive view of the robustness and range of the model performance.

However, models built with cross-validation for hyperparameter tuning or model selection still require a true hold-out sample for the final evaluation to avoid overfitting. Since cross-validation uses all data available for training, it does not have a feedback mechanism for checking the model performance on independent data. Models and hyperparameters chosen with cross-validation are simply the best versions trained on the available data; therefore, a hold-out test sample or a nested approach is still necessary for a final evaluation.

Cross-validation techniques fall into exhaustive and non-exhaustive approaches.

Exhaustive Cross-Validation

Exhaustive cross-validation techniques split the dataset into training and validation sets in every way possible.

  • Leave-P-Out Cross-Validation (LPOCV): This leaves out p number of samples for validation and uses the rest for training. It can become computationally costly (rounds) and produces overlapping validation sets.
  • Leave-One-Out Cross-Validation (LOOCV): This is a special case of LPOCV that leaves out a single sample for each iteration. It provides a comprehensive evaluation since it uses all samples for validation. LOOCV is less costly than LPOCV (n rounds); however, it still requires more computational resources than non-exhaustive approaches such as k-fold cross-validation.

Leave-One-Out Cross-Validation (LOOCV) (Source)

Non-Exhaustive Cross-Validation

Non-exhaustive approaches approximate model performance by using only some of the possible sampling iterations.

  • Repeated Hold-Out Cross-Validation (repeated random sub-sampling or Monte Carlo cross-validation). This repeats random hold-out sampling many times. A potential downside of this method is that it may use certain samples multiple times and leave others out entirely since no sample is guaranteed to be in one or the other.
  • K-fold cross-validation. This divides the entire dataset into k subsamples. It runs k iterations, and in each iteration, it validates the model on a different subsample and trains the model on the rest.

K-Fold cross-validation is the most widely used approach and is the focus for the remainder of this article.

2-Fold Cross-Validation and Repeated Hold-out Cross-Validation Methods (Source)

K-Fold Cross-Validation

In K-Fold cross-validation, the entire dataset is divided into k subsamples to repeat the model training and validation process k times. Each subsample is a validation set in one round and a part of the training set in the rest of the rounds.

The advantages of the K-Fold cross-validation compared to other cross-validation methods are:

  • It uses all data samples in the data for validation once and only once.
  • The computational cost is relatively low (k rounds). The choice of k changes the validation robustness and runtime.
  • It avoids overlap between the training and validation sets.

5-Fold Cross-Validation (Source)

There are ways to enhance cross-validation techniques. Here are some examples:

  • Random Permutation (Shuffling). This shuffles the data samples and generates splits from the shuffled data, instead of defining subsamples based on the default order of the dataset.
  • Repeated K-Folds. Depending on a number of factors, a single round of k-fold cross-validation may not assess the model adequately. Performing multiple rounds of k-fold validation can help address this, with redefined splits in each round.
  • Stratification. If the class labels are unbalanced, cross-validation may produce subsamples with widely varying class distributions or even without a member of the minority class, especially if it is a “rare event.” Creating stratified subsamples preserves the class frequencies within the subsamples to mitigate the problem.
  • Nesting. If model optimization is part of the model selection process, one can cross-validate the optimization step as part of the training stage of the original cross-validation (see the Model Optimization section below for details).

Depending on the specific Machine Learning problem, domain-specific techniques may be necessary as standard cross-validation may not be the most appropriate. These scenarios include:

Data Leakage in Cross-Validation

Data leakage occurs during the training process when a model uses information not available in the production environment. The model performs better with it than without, and the extra information gives the illusion of better model performance.

Cross-validation does not prevent all forms of data leakage. The model may pass both the cross-validation and the testing phases with flying colors; however, it underperforms in production. Obviously, this is very problematic.

Two examples of data leakage during cross-validation are leakage due to normalization and leakage due to model optimization.


Normalizing data before cross-validation may leak test distribution metadata information into the training dataset. This weakens the role of the test set as the “new, previously unseen” data for final model evaluation.

Consider the following. There is an outlier in the test set. Normalizing the data before splitting transforms the training samples based on the distribution metadata of the entire dataset, including the outlier values in the test set. A model trained on such normalized data contains some information about the test set. As such, the test set is no longer a new, never-before-seen dataset, making any evaluation using it less efficient.

The correct way to handle this is to split the data first, then use the metadata distribution of the training set to normalize both the training and test sets. This maintains the test set untouched and appropriate for a completely independent evaluation.

Model Optimization

Use nested cross-validation to avoid data leakage during model optimization. Nested cross-validation optimizes the model in a separate inner cross-validation step.

Model optimization involves tasks for improving model performance beyond model training such as hyperparameter tuning and feature selection. These optimization steps are similar to assessing performance on instances of model training. When done as part of normal cross-validation training, model optimization runs on the validation set of that round but without its own set for model evaluation. It indicates the degree to which the model is optimized but not the degree to which the optimized model generalizes.

Nested cross-validation “nests” cross-validation steps into the training stage of the original cross-validation in two loops:

  • Inner cross-validation loop. Evaluates the model optimization methods and returns the best-performing configuration.
  • Outer cross-validation loop. Evaluates the model configurations chosen in the inner loop and calculates their aggregate performance.

The outer cross-validation loops can receive different model configurations from the inner loop, making this method somewhat harder to interpret than simple cross-validation.

It helps to think of nested cross-validation as a method to evaluate not only the models themselves but also the entire modeling pipelines or model training functions. In that sense, the outer loop evaluates not single models but the model-building process, which includes the selection and optimization of models from the inner cross-validation.

As is the case with simple cross-validation, the nested cross-validation does not identify a final model but evaluates whether the model-building process generalizes well on data not yet seen. When it returns acceptable results, the steps to select the final model are:

  1. Rerun the inner cross-validation on the entire dataset.
  2. Configure the model with the hyperparameters found during the previous step.
  3. Fit the model on the entire dataset.

Nested Cross-Validation (Source)

The Ongoing Validation of Machine Learning Models

Proper evaluation of model performance in Machine Learning is critical. The discussion above has covered validations and folds, the main mechanism and data components of a model performance evaluation, cross-validation for robustness even with limited data resources, and the problem of data leakage in cross-validation.

However, model validation in Machine Learning is too vast a topic to mention everything in a single article. There are statistical tests for comparing models, techniques such as early stopping and bootstrapping, and topics on validating small datasets, just to name a few. The amount of detail to keep in mind while developing a Machine Learning product is staggering. For that reason, it is important to leverage a proven validation system against common as well as not-so-common issues to ensure continued trust and confidence in the models.

Machine Learning models need to continuously perform well even when the data, the development tools, and/or the business environment changes. To mitigate problems, it is important to:

These require a lot of time and expertise. Leveraging continuous validation solutions like Deepchecks can be a lifesaver.

Deepchecks works with a wide range of model types (e.g., fraud, LTV, and NLP models) and implements the necessary components (e.g., dataset validation) in the recommended validation pipeline. It generates alerts when the model starts to overfit and provides tools to monitor the Machine L

earning lifecycle. Would you like to hear more details? Let us know!

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Recent Blog Posts