Introduction
In machine learning (ML), no single universal model can cater to every dataset or business problem. Every model has its capabilities, strengths, and weaknesses. Their applicability varies based on the dataset format, quality, and the problem they are trying to solve. This brings us to the concept of ML model selection, a critical step in the ML development lifecycle.
Model selection is a machine learning process used to choose the best model for a given task from a collection of candidate models. The candidate models are assessed using different model selection techniques. The outcome of model selection relies on a robust validation strategy and appropriate evaluation metrics (discussed below) that can quantitatively verify the quality of the model.
Let’s explore ML model selection in detail and list some prominent model selection and validation techniques.Source
In machine learning (ML), no single universal model can cater to every dataset or business problem. Every model has its capabilities, strengths, and weaknesses. Their applicability varies based on the dataset format, quality, and the problem they are trying to solve. This brings us to the concept of ML model selection, a critical step in the ML development lifecycle.
Model selection is a machine learning process used to choose the best model for a given task from a collection of candidate models. The candidate models are assessed using different model selection techniques. The outcome of model selection relies on a robust validation strategy and appropriate evaluation metrics (discussed below) that can quantitatively verify the quality of the model.
Let’s explore ML model selection in detail and list some prominent model selection and validation techniques.
Why is Model Selection in Machine Learning Important?
ML models often work well in controlled academic settings but fail in the production environment, especially on an industrial scale. That’s because the real-world environment has a number of factors (discussed in the next section) to consider that can limit the performance of the ML model. Hence, rigorous model selection is needed, which can significantly impact the model’s performance and accuracy for real business problems.
Besides finding the best-suited model for a particular task, model selection is important for several other reasons, such as:
- Ensures that the model’s predictive performance is generalizable to unseen real-world data.
- Helps avoid overfitting or underfitting, which can occur if the model is poorly trained. As a result, ML model selection optimizes the model with an appropriate bias-variance trade-off.
- Helps balance the model’s performance and the cost of computational resources used to build the model.
5 Important Factors to Consider While Selecting an Appropriate Machine Learning Model?
While the best-performing ML model is required for any task, performance is not necessarily the only factor to consider when selecting a model. Some other prominent factors include:
1. Dataset Size & Format
Different ML models are designed to handle specific data types. For instance, artificial neural networks excel in processing vast amounts of numerical data, while transformer models work well for Natural Language Processing (NLP) tasks.
Also, large-scale datasets can only be managed using advanced AI models based on deep neural networks. Generally, simpler tree-based or gradient models can’t capture all information in large datasets. Hence, dataset size and format add significantly to the complexity of the model selection process.
2. Training Time, Inference Time & Associated Costs
ML models, especially enterprise-grade, can take days or months to train. For instance, researchers estimate that the time required to train a GPT-3 model with 175 billion parameters is 34 days. Their experiments also show that a 530 billion parameter can take up to 140 days of training time.
Such models need thousands of costly high-end GPUs to manage the required processing power. For instance, OpenAI, which has developed the revolutionary GPT-4 language model, redesigned its entire deep learning stack in the last two years to build a supercomputer based on Azure cloud. Moreover, inference or prediction time on real-world data also requires significant infrastructure resources.
That is why large models are usually maintained by bigger AI labs or organizations that have the required capital to fund such ML projects. Hence, the model choice is often dictated by the allocated budget approved by project stakeholders.
3. Performance Metrics
Evaluating machine learning models is a critical step in the ML development lifecycle. Different ML models require different evaluation metrics to monitor and evaluate their performance. For instance, some prominent ML tasks and their suitable evaluation metrics are given below.
Machine Learning Task | Machine Learning Evaluation Metrics |
Classification |
|
Regression |
|
Clustering |
|
Natural Language Processing (NLP) |
|
Computer Vision (CV) |
|
4. Explainability
ML algorithms operate like black boxes, i.e., data goes in – the model processes it and generates an outcome. It’s difficult to explain how they reach an outcome.
Lately, machine learning explainability has become critical as models generate false outcomes like bias, hallucination, and harmful or discriminatory content. And the absence of explainability deems the model inappropriate for commercial use. Hence, the model selection process must evaluate the model on how explainable its outcomes are.
5. Complexity
Complex ML models can capture more details from large datasets but are difficult to maintain. Recently, we have witnessed a wave of language models with billions of parameters, e.g., DeepMind’s Gopher and Google’s PaLM, with 280 and 540 billion parameters, respectively. Such massive models are very hard to monitor and retrain (if needed).
For smaller datasets, complex models often result in overfitting. Hence, the complexity of the selected ML model must align with the complexity of the problem it is trying to solve.
Top ML Model Selection & Validation TechniquesÂ
Based on the factors discussed above, how to choose an optimal ML model? To answer this question, you must first understand which model selection techniques are available.
ML model selection techniques are categorized into two groups: probabilistic and resampling methods.
3 Prominent Probabilistic Techniques for Model Selection
Probabilistic model selection techniques assess candidate models based on training performance and the model’s complexity. For instance, a model with fewer parameters is less complex. Such a model would get a better score among candidate models.
Some commonly used probabilistic techniques are:
1. Akaike Information Criterion (AIC)
The Akaike Information Criterion (AIC) measures the quality of a statistical model for a given dataset. It balances the trade-off between the goodness of the model’s fit on the training data and the complexity of the model. AIC penalizes models with more parameters, encouraging the selection of simpler models that still represent the training data well. A lower AIC value indicates a better-fitting model.
Formula:
AIC = 2k – 2ln(L)
k = the number of parameters
L = the log-likelihood of the model, i.e., how well the model sees its training data
The AIC statistic is available in the Python statsmodel library. See the code snippet below.
import numpy as np from statsmodels.regression.linear_model import OLS # Sample data y = np.random.normal(size=10) x1 = np.random.normal(size=10) x2 = np.random.normal(size=10) x3 = np.random.normal(size=10) # Sample Least Square models for comparison model1 = OLS(y, x1) result1 = model1.fit() model2 = OLS(y, x2) result2 = model2.fit() model3 = OLS(y, x3) result3 = model3.fit() print ("Model 1 AIC:", result1.aic, "\n" "Model 2 AIC:", result2.aic, "\n" "Model 3 AIC:", result3.aic)
Output:
Model 1 AIC: 30.327637041796276
Model 2 AIC: 29.47561019149594
Model 3 AIC: 28.87710902765963
If you want to manually calculate AIC in Python, plug this snippet into your code:
# number of parameters k = len(result1.params) # log likelihood value lnL = model1.loglike(result1.params) def AIC(k, lnL): # AIC = 2k - 2ln(L) aic = 2 * k - 2 * lnL return aic
2. Bayesian Information Criterion (BIC)
Derived from Bayesian probability and inference, the Bayesian Information Criterion (BIC) is a model selection statistic similar to AIC but includes a stronger penalty for model complexity. BIC is particularly suitable for models trained using maximum likelihood estimation. More complex models will have a larger BIC score, which means poor models.
Formula:
BIC = -2ln(L) + kln(N)
K = the number of parameters
L = the log-likelihood of the model for the training data
N = the number of data points/samples
Like AIC, the BIC statistic is also available in the Python statsmodel library. See the code snippet below.
import numpy as np from statsmodels.regression.linear_model import OLS # Sample data y = np.random.normal(size=10) x1 = np.random.normal(size=10) x2 = np.random.normal(size=10) x3 = np.random.normal(size=10) # Sample Least Square models for comparison model1 = OLS(y, x1) result1 = model1.fit() model2 = OLS(y, x2) result2 = model2.fit() model3 = OLS(y, x3) result3 = model3.fit() print ("Model 1 BIC:", result1.bic, "\n" "Model 2 BIC:", result2.bic, "\n" "Model 3 BIC:", result3.bic)
Output:
Model 1 BIC: 36.028481849183066
Model 2 BIC: 36.03658485115459
Model 3 BIC: 35.705662609637145
If you want to manually calculate BIC in Python, plug this snippet into your code:
# number of parameters k = len(result1.params) # log likelihood value lnL = model1.loglike(result1.params) lnN = np.log(len(y)) def BIC(k, lnL, lnN): # BIC = -2ln(L) + kln(N) bic = - (2 * lnL) + k * lnN return bic
3. Minimum Description Length (MDL)
The Minimum Description Length (MDL) method aims to find the model that best balances the complexity and goodness of the model’s fit by minimizing the total description length (in bits) of the model and the data it explains. It is based on the principle that the best model is the one that compresses the data the most.
Formula:
MDL = L(h) + L(D | h)
h = model
D = predictions made by the model
L(D | h)= the number of bits required to represent the model predictions on the training dataset
L(h) = the number of bits required to represent the model
3 Prominent Resampling Techniques For Model Selection
Resampling methods assess model performance on new or unseen data, usually drawn from the training dataset repeatedly to perform calculations.
Some commonly used resampling techniques are:
1. Random or Time-Based Split
To generate new data samples, the training dataset can be split into multiple sets. The split can be random or time-based. The random split method randomly divides training data into training, testing, and validation sets. This procedure is repeated to check the model’s performance on numerous testing sets and observe its reliability.
Time-based split is usually done for data involving a time component, for instance, weather or stock market data. That’s because time-series data lacks mutual independence – one event can influence all subsequent data inputs. Hence, the training data is split based on a specific interval, like a week. The previous week would become the training set, and the following week would become the test set.
The code snippet below demonstrates a random split operation using Python’s sklearn library.
from sklearn.model_selection import train_test_split X_data = range(10) y_data = range(10) for i in range(5): X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.3,random_state = None) print(y_test)
Output:
[4, 2, 8]
[1, 6, 0]
[3, 8, 9]
[1, 3, 5]
[5, 3, 2]
2. Bootstrap
The bootstrap model selection technique creates a stabilized model by resampling data points from the original dataset with replacement. Meaning it randomly selects data points to form a “bootstrapped sample” and returns these data points to the original dataset after the model has been trained on the bootstrapped sample and evaluated on the remaining dataset (also known as out-of-bag sample or OOB).
The code snippet below demonstrates bootstrap sampling using Python.
import random # Create a random distribution and calculate its mean x = np.random.normal(loc= 50, size=100) print ("Distribution mean:", np.mean(x)) # Use bootstrap sampling to calculate the mean of the distribution bootstrapped_sample_mean = [] # Number of samples n = 50 # Sample size k = 5 for i in range(n): # Bootstrapped sample y = random.sample(x.tolist(), k) # Calculate the mean of the bootstrapped sample avg = np.mean(y) bootstrapped_sample_mean.append(avg) print("Mean of bootstrapped samples:", np.mean(bootstrapped_sample_mean))
Output:
Distribution mean: 50.09707119894606
Mean of bootstrapped samples: 50.142366658259164
3. Cross-Validation
Cross-validation is one of the most commonly used resampling techniques for ML model selection. It consists of different types, such as:
- K-Fold Cross Validation: It divides data into K subsets, training on K-1 folds and evaluating on the remaining fold. This process is repeated K times, each with a different fold held out for testing.
- Stratified K-Fold Cross Validation: It is a variation of the K-fold that preserves the class distribution in each fold. It ensures that each fold maintains a similar proportion of samples from each class, which is particularly useful when dealing with imbalanced datasets.
- Leave-one-out Cross-Validation (LOOCV): Each data sample in the original dataset is treated as a separate fold. In one iteration, the model is tested on one sample and trained on the rest. Hence, in each iteration, the test sample is changed. It provides a comprehensive evaluation but is computationally expensive.

Cross-validation (Source)
The code snippet below demonstrates k-fold cross validation using Python’s sklearn library.
from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression # define a sample dataset from sklearn.datasets import make_classification X, y = make_classification(n_samples=50, n_features=10) # create a Logistic Regression model model = LogisticRegression() # prepare the cross-validation method cv = KFold(n_splits=10, random_state=1, shuffle=True) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) print('Accuracy (k=10): %.3f' % (mean(scores)))
Output:
Accuracy (k=10): 0.880
Streamline Your ML Model Selection Process To Maximize Performance
ML model selection presents significant challenges. As presented in the article, there are a multitude of methods available for model selection. How can you choose the best one?
You can experiment with multiple techniques and try to interpret their results. Or you can talk to our experts at Deepchecks. They can show you how to validate each aspect of your ML lifecycle by continuously testing models in pre-production and production environments.
Deepchecks offers a range of features for all your AI and ML validation requirements, including model testing, CI/CD, and monitoring. Explore Deepchecks’ open-source library today to start optimizing your machine learning workflows.