If you like what we're working on, please  star us on GitHub. This enables us to continue to give back to the community.

How to do cross-validation?

Randall Hendricks
Randall HendricksAnswered

Implementing cross-validation in machine learning is the key to ensuring that a lot of biases, overfitting, and even chance defects don’t get passed on to the model so we are always left with the best possible combination of parameters for a model. There are different types of methods available for cross-validation:

1. Holdout Cross-Validation

It is the simplest form of cross-validation and is the most commonly used implementation. Here, we simply divide the data into two separate parts, the training and test dataset. The usual proportion is somewhere around a 70:30 split, training and testing datasets respectively. This number varies depending on the availability of data for a given problem. We train our model on the training data and validate the model prediction on the hold-out test set. Sklearn has built-in support for this validation technique.

2. K-Fold Cross-alidation

Perhaps the second most frequently used cross-validation technique, this minimizes the disadvantages of the hold-out cross-validation as we will be dissecting the dataset into multiple small chunks, and each one will serve as a training and test dataset. Now, as we will be using all the parts of the dataset as training and testing data, we can gauge the variance in the optimization metric when a different subset of the dataset is being used. For a good model that is not an overfit, it should not have a significant amount of variation in terms of the optimization metric (accuracy, f1-score, precision, recall, etc.). In case we detect such discrepancies, they are almost always attributable to an overfit model or a biased dataset.

3. Leave-One-Out Cross-Validation

This method can also be considered as one of the edge cases of k-fold cross-validation. Here, we choose one subset or fold of the dataset which will serve as our test set for a given experiment. With the remaining number of folds or samples, we will be training our model, the only difference being that in each iteration, we will be training a separate model. Once validation on our test set after a single iteration is complete, we proceed  and re-run the same to get an average final score and complete the cross-validation process.

Open source package for ml validation

Build Test Suites for ML Models & Data with Deepchecks

Get StartedOur GithubOur Github

Subscribe to Our Newsletter

Do you want to stay informed? Keep up-to-date with industry news, the latest trends in MLOps, and observability of ML systems.