If you like what we're working on, please  star us on GitHub. This enables us to continue to give back to the community.

Top Techniques for Cross-validation in Machine Learning

This blog post was written by Inderjit Singh as part of the Deepchecks Community Blog. If you would like to contribute your own blog post, feel free to reach out to us via blog@deepchecks.com. We typically pay a symbolic fee for content that’s accepted by our reviewers.


A good machine learning model needs to have many properties, but one of the most fundamental requirements is that it should be able to generalize, i.e. it should be able to perform well on the data points that it has not seen during the training process. This ability is critical for the models as this is the deciding factor as to if a model is useful in the real world and will stand the test of production deployment. The weights and biases arrived at after the training process, therefore, need to be tested thoroughly in order to ensure that we, in fact, have a proper” fit. A good way to ensure that is to cross-validate the model. Good cross-validation will ensure that the results arrived at during the training and development phase are reproducible and will be consistent, unless, of course, any external shifts in incoming data distribution are observed. A good cycle of the machine learning validation process will yield “best hyperparmeters” and will look something like this:

Source: Deepchecks

What is cross-validation?

According to Wikipedia cross-validation (also referred to as rotation estimation or out of sampling testing in statistics), is referred to various model validation techniques that are creating a quantitative measure of the results of statistical analysis such that the models produced will be able to generalize, to an independent dataset or a holdout dataset. The model that is the state of overfit, is of limited value in the real-world. But such models can sometimes yield good results on the validation dataset. The scenario is specifically likely if the training and the testing dataset are smaller in size. Therefore under such circumstances, it is critical that we perform cross-validation in the training set, i.e. we should be implementing cross-validation for the entire dataset.

How to perform cross-validation with various techniques?

There are various ways to perform cross-validation and depending on the model, availability of data, and the kind of problem we are working with will be a deciding factor as to which technique will work best for us. A few of the most important techniques are as follows:

1 . Holdout Cross-Validation

It is the most commonly used technique for validation. We split the entire dataset into two unequal halves, where most of the data points are used for training and the rest will be used to validate the model, i.e. to make sure that reduction in the objective function is corresponding to better predictions and is not a case of overfitting. A simple cross-validation example for this will be splitting the dataset into a 70:30 ratio where the 70% of the dataset will be used for the training process, and the 30% of the dataset will be used for validation of the model, various iterations of the same are highlighted in the figure below:

Image Source: blog.jcharistech.com

A small code snippet to implement this is

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
iris=load_iris()
X=iris.data
Y=iris.target


linear_reg=LogisticRegression()
# the actual splitting happens here
x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.3,random_state=42)
linear_reg.fit(x_train,y_train)
predictions=linear_reg.predict(x_test)

print("Accuracy score on training set is {}".format(accuracy_score(linear_reg.predict(x_train),y_train)))
print("Accuracy score on test set is {}".format(accuracy_score(predictions,y_test)))

2. Stratified K-Fold CV(Cross-Validation)

Stratification is used when the datasets contain unbalanced classes. Therefore if we cross-validate with a normal technique it may produce subsamples that have a varying distribution of classes.  Some unbalanced samples may produce exceptionally high scores leading to a high cross-validation score overall, which is undesirable. Therefore we create stratified subsamples that preserve the class frequencies in the individual folds to ensure that we are able to get a clear picture of the model performance. The following figure clarifies the same with this visualization:


Source:
stats.stackexchange.com

 

A small code snippet for the same is accessible as under:

from sklearn.datasets import load_iris
from sklearn.model_selection import StratifiedKFold,cross_val_score
from sklearn.linear_model import LogisticRegression
iris=load_iris()
X=iris.data
Y=iris.target

linear_reg=LogisticRegression()
Stratified_cross_validate=StratifiedKFold(n_splits=5)
score=cross_val_score(linear_reg,X,Y,cv=Stratified_cross_validate)

print("Cross Validation Scores are {}".format(score))
print("Average Cross Validation score :{}".format(score.mean()))

3. Leave P Out CV (Cross-Validation)

Leave P out is one of the exhaustive cross-validation techniques, where we use the entire dataset for the training and validation cycles, for example lets say we have 1000 data points in our dataset, if we set value of p to 100 then in each cycle, we will have 100 values being used as validation set and the rest of the 900 datapoints will be used as the training data points. The following image shows p out cross validation visually


Source: Researchgate

A code snippet to implement the same is as follows:

from sklearn.model_selection import LeavePOut,cross_val_score
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
iris=load_iris()
X=iris.data
Y=iris.target

leave_p_out=LeavePOut(p=2)
leave_p_out.get_n_splits(X)

random_forrest_classifier=RandomForestClassifier(n_estimators=10,max_depth=5,n_jobs=-1)
score=cross_val_score(random_forrest_classifier,X,Y,cv=leave_p_out)

print("Cross Validation Scores are {}".format(score))
print("Average Cross Validation score :{}".format(score.mean()))

4. Monte Carlo Cross Validation/Shuffle Split

This type of cross validation is used as a flexible cross validation strategy, using this approach we split the data points into a number of partitions randomly, we will still be setting the percentage of the of the training and validation set, but the partitions are created at random. The following images has the visual representation for the same for better clarity:

A code snippet for monte carlo simulation as follow:

from sklearn.model_selection import cross_val_score,ShuffleSplit
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
linear_regression=LogisticRegression()

shuffle_split=ShuffleSplit(test_size=0.3,train_size=0.5,n_splits=10)
cross_val_scores=cross_val_score(linear_regression,iris.data,iris.target,cv=shuffle_split)

print("cross Validation cross_val_scores:n {}".format(cross_val_scores))
print("Average Cross Validation score :{}".format(cross_val_scores.mean()))

5. Time Series CV(Cross Validation)

Regular cross validation techniques are not useful when working with time series datasets, time series datasets, can’t be randomly split and used for training and model validation, as we might miss on important components such as seasonality etc. With the order of the data being important it is difficult to split the data in any given interval. To tackle this issue we can use time series cross validation.
In this type of cross validation we take a small subsample of the data (keeping the order intact) and try and predict the immediate next examples for validation, this is also referred to as “forward chaining” or sometimes also refered to as “rolling cross validation”, as we are continuously training and validating the model on the small snippets of data we are sure to found a good model if we can see that it is able to give good result on this rolling samples. The following image highlights how it is implemented on a data sample:

Source: Towards datascience

 

A small code snippet to implement this cross-validation is as follows:

import numpy as np
from sklearn.model_selection import TimeSeriesSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4], [77,33]])
y = np.array([1, 2, 3, 4, 5, 6, 7])

rolling_time_series = TimeSeriesSplit()
print(rolling_time_series)

for current_training_samples, current_testing_samples in rolling_time_series.split(X):
    print("TRAIN:", current_training_samples, "TEST:", current_testing_samples)
    X_train, X_test = X[current_training_samples], X[current_testing_samples]
    y_train, y_test = y[current_training_samples], y[current_testing_samples]

 

6. K Fold Cross-Validation

This is one of the most famous implementation techniques for cross-validation, the main focus in this one is around creating different “folds” of data(usually equal in size), which we use for validating the model and the rest of the data is used for the training process. All the folds will be iteratively used for the validation process with the others combined for the training data sample. As the name suggests the training cycles in this will be repeated k-times, and the final accuracy will be computed by taking the mean from the individual data validation runs. The following image demonstrates k fold cross validation for a given sample:

Source: Towards data science

 

A sample code snippet to do k fold validation is as follows:

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score,KFold
from sklearn.linear_model import LogisticRegression
iris=load_iris()
features=iris.data
outcomes=iris.target

logreg=LogisticRegression()
K_fold_validation=KFold(n_splits=5)
score=cross_val_score(logreg,features,outcomes,cv=K_fold_validation)

print("Cross Validation Scores are {}".format(score))
print("Average Cross Validation score :{}".format(score.mean()))

Conclusion

We covered some techniques for cross-validation of machine learning models, these are some of the most prominent examples while there are certainly other methods, an exhaustive list can be found here. The choice of a particular type of CV will be largely based on the specific implementation being carried out, availability of data points, availability of compute, time, etc. There may be a chance to combine multiple techniques into a single pipeline so we can be certain that the results obtained are reproducible and free from any biases. I sincerely hope that this helps people create better and models with less bias, which will solve real-world problems and will have a positive impact on society.

Subscribe to Our Newsletter

Do you want to stay informed? Keep up-to-date with industry news, the latest trends in MLOps, and observability of ML systems.

Related articles

MLOps Best Practices
MLOps Best Practices
×

Check It NowCheck It Now
Check out our new open-source package's interactive demo

Days
:
Hours
:
Minutes
:
Seconds
Check It Now