If you like what we're working on, please  star us on GitHub. This enables us to continue to give back to the community.

How to determine the best model for prediction with cross-validation?

Kayley Marshall
Kayley MarshallAnswered

When we speak of models, we mean a specific way of defining how certain input data correlates to what we are attempting to predict. We don’t refer to specific instances of that approach as separate models in general. So while you may say, “I have a linear regression model,” you wouldn’t call two separate sets of learned coefficients independent models. Not in the framework of model selection, at any rate.

When you use predictive model validation techniques, you are checking how effectively your model can be trained with certain data and then predict data that it hasn’t seen before. We utilize cross-validation for this since training with all of the data leaves no room for testing. You could accomplish this by using 70% of the data for training and 30% for testing. But what happens if the 30% you choose to test contains a lot of points that are exceptionally simple (or especially difficult) to predict? We won’t have gotten the most accurate evaluation of the model’s capacity to learn and forecast.

We intend to make use of all the data. To continue with the 70/30 example, we would perform a 5-fold cross-validation by developing the model 5 times on 70% of the data and testing on 30%. We make certain that each data point appears precisely once in the 30% test set. As a result, we’ve used every data point available to help us determine how effective our model fulfills the goal of learning from some data and predicting some new data.

However, the goal of cross-validation is not to arrive at our final model. We don’t make any predictions using these 5 examples of our trained model. To create the best model possible, we want to incorporate all the data we have. Cross-validation is used to assess models, rather than develop them.

Let’s imagine we have two models, one linear regression and one neural network. How can we tell which model is superior? We may use K-fold cross-validation to check which one is more accurate at predicting test set points. But after choosing the model with the best performance using cross-validation, we train that model—whether it be a linear regression or a neural network—on all the data. For our final predictive model results cross-validation, we do not use the actual model instances that we trained.

Open source package for ml validation

Build Test Suites for ML Models & Data with Deepchecks

Get StartedOur GithubOur Github

Subscribe to Our Newsletter

Do you want to stay informed? Keep up-to-date with industry news, the latest trends in MLOps, and observability of ML systems.