🎉 Deepchecks raised $14m!  Click here to find out more 🚀

Model Confidence and How it Helps Model Validation

This blog post was written by Tonye Harry as part of the Deepchecks Community Blog. If you would like to contribute your own blog post, feel free to reach out to us via blog@deepchecks.com. We typically pay a symbolic fee for content that's accepted by our reviewers.


Typically, machine learning algorithms sift through mountains of data in order to discover patterns and/or predict future outcomes. These predictions are almost never perfect since they are based on probabilities rather than certainties. The confidence in a machine learning model is crucial in detecting a fraudulent transaction, diagnosing cancer in a patient, making a profit or going bankrupt, speech recognition, facial recognition, and a multitude of other important applications.

Essentially, we need to know how we can ascertain our machine learning model does what we want it to accomplish and how reliable those predictions are.

In this article, we discuss the concept of model confidence with model accuracy metrics that can influence it, the use-cases these metrics apply to, when to use them, and how confidence and accuracy in model validation are examined.

So let’s dive right in!

What is model confidence?

Model Confidence, often confused with Accuracy, is an indication of how likely (probability) the predictions of a machine learning algorithm are correct. It indicates how well the model is performing toward achieving its goal. Model accuracy, on the other hand, refers to the model’s skill in the percentage of predictions it gets right for a certain use-case. Model confidence is usually measured with a confidence level and although not often used in model validation, remains critical for the overall process.

Confidence Level

A Confidence Level is the probability that a model gets to (or is close to) an estimated prediction every time it is used. This is frequently expressed as a number (confidence coefficient) or a range of numbers in percentage (confidence interval) between 0 to 100%. Confidence intervals measure the level of certainty of an estimate, given a lower and upper limit or bound alongside a probability.

The components of a confidence interval:

  • Range depicts the expected skill of the model with a lower and upper bound to indicate the lowest and maximum skill levels, respectively.
  • Probability is how likely the model skill will fall within the range.

Consider that the accuracy estimate of a classification model is 82%. Using a confidence interval, for example, you may infer the true model accuracy to be between 80-85%, with a 95% likelihood. This is interpreted as your model’s skill having a 95% probability to correctly generalize (based on the use-case) 80-85% of the time.

A confidence interval can be also used to present errors in a model. These intervals guide practitioners in the model selection process when comparing models. Stakeholders can identify the level of certainty they require (usually 95%), setting expectations for the output value of the accuracy metrics being used.

To get a practical view of this concept, take a look at confidence intervals for ML by Jason Brownlee.


  • Confidence levels allow us to weigh the outcomes of a model’s prediction. With a range, it enables quick troubleshooting to investigate any problem within an ML system if the model’s behavior changes below or above the interval range.
  • Accuracy and confidence are independent of each other, so during model training and selection, confidence levels can be added to the criteria to reduce the likelihood of model overfitting on the validation set.

Model Accuracy

Model accuracy can be determined by a variety of metrics, but choosing one for your situation should be done with care. As a rule of thumb, whenever you measure accuracy, use more than one metric; a single metric may give us information, but it does not tell the whole story.

A metric should be selected based on the nature of the problem to provide the greatest benefit in practice. Due to the diversity of use-cases and datasets, there is no one-size-fits-all metric. The majority of machine learning algorithms fall into one of two categories: classification or regression.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Metrics for Classification Models

Classification models categorize datasets into different classes (discrete output) and are used in applications such as facial recognition, fraud detection, speech recognition, handwriting recognition, and document classification. A classification model’s performance can be evaluated using various metrics during training, testing, or deployment.


The confusion matrix is used to evaluate the performance of a machine learning model by comparing the target values with the model’s predictions. It summarizes the results of a classification algorithm, showing what the model gets right and what types of errors it is making.

In the confusion matrix, rows represent the instances in predicted classes, and columns represent actual classes.

The confusion matrix isn’t really a performance metric, but a kind of benchmark where other metrics can be computed. This tool can be used to compute other metrics like accuracy, precision, and recall. The confusion matrix in a binary classification problem is shown below:

A confusion matrix

Fig. 1: A confusion matrix. Source

Let’s take a closer look at the matrix, where a machine learning model is deployed to detect liver disease.

True Positive (TP) depicts the number of positive samples the model predicted correctly. These will be cases where the model predicted a patient has a disease and he does have liver disease.

True Negative (TN) depicts the number of negative samples the model predicted correctly, specifically, the model predicted the patient does not have the disease and the patient actually doesn’t have liver disease.

False Positive (FP) depicts the number of positive samples the model predicted incorrectly. The model predicted a patient has a disease, but not liver disease. This is also called a Type I error.

False Negative (FN) depicts the number of negative samples the model predicted incorrectly. The model predicted the patient does not have a disease but the patient actually has liver disease. Also known as Type II error.

A confusion matrix can be generated after training your model using the sklearn library by running the code segment below:

from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(ground_truth, model_predictions)


This classification is probably the most intuitive metric, computed as the ratio of correct predictions and the total number of predictions. In our liver disease use-case, accuracy answers how many patients were correctly predicted in the entire dataset.


Accuracy is computed by comparing the ground truth values to the predicted values using the confusion matrix or the accuray_score in the sklearn library.

Accuracy is useful but does not reliably reflect performance, especially when datasets are unbalanced. For example, in a dataset of 100 patients, 10 have liver disease and 90 are healthy. An accuracy of 90% after training a machine learning model on this dataset may simply indicate that it is good at detecting healthy patients, which isn’t very useful. For this model to be effective, it needs to ensure that patients who are actually suffering from the disease are not overlooked. In light of this, alternative metrics are necessary.


It is the ratio of positive instances and the total positive predictions. It attempts to answer how likely a model can be correct given a positive prediction from the classifier. To put it another way, to what extent is the model right when it says it is.


It ensures we don’t misclassify too many people as having the disease when they don’t. Consequently, patients won’t undergo treatment for a disease they don’t have.

Recall or Sensitivity

Recall represents the relationship between positive instances and the total amount of positive instances in the ground truth.  It attempts to answer whether a classifier can detect a positive instance when given a positive example.

Recall Sensitivity

In this way, recall ensures we don’t overlook those who have the disease. This prevents us from predicting a person does not have a disease when in fact they do.

F1 Score

The F1 score is the harmonic mean of precision and recall. By combining precision and recall metrics, a single value is generated that indicates a high precision along with a high recall, if high.

F1 score

Despite the benefits of the F1 score, it has several disadvantages, including a low F1 score that provides no useful information about the model, and it treats precision as equally important as recall. There may be instances where the F1 score isn’t the best metric to use, like when you wish to rank one metric higher than the other. It may be more appropriate to use a weighted F1 Score, a PR Curve, or an ROC Curve.

Precision-Recall (PR) Curve

The PR Curve shows the tradeoff between precision and recall for various threshold values. Ideally, we want the curve to appear at the top right corner of the graph where we get a high precision (low false positive rate) and recall (low false negative rate). The PR Curve is useful for classification in an unbalanced dataset.

precision-recall curve with unbalanced data

Fig. 2: precision-recall curve with unbalanced data. Source

ROC and AUC Score

A Receiver Operating Characteristic (ROC) Curve summarizes the performance of the model on the positive class. It plots the curve tradeoff between the true positive rate and the false positive rate for various threshold values. The area under the curve is called the ROC AUC (Area Under the Curve) where the higher the value the better. Here is an example of an ROC Curve:

ROC curve plotting true and false positives

Fig. 3: ROC curve plotting true and false positives. Source

The classification metrics in this section can be computed using the sklearn library as shown in the code fragment below. The ground truth is derived from your dataset, and model predictions are from the trained model.

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score
ground_truth = [0, 0, 1, 1, 1, 0, 0, 0, 1, 1]
model_predictions = [1, 1, 1, 0, 0, 0, 1, 0, 1, 0]
#Get confusion matrix
conf_matrix = confusion_matrix(ground_truth, model_predictions)
#Compute accuracy
accuracy = accuracy_score(ground_truth, model_predictions)
# Get classification report i.e shows f1 score, precision, recall
Class_report = classification_report(ground_truth, model_predictions)
# Get ROC AUC score
Roc_auc = roc_auc_score(ground_truth, model_predictions)

Metrics for Regression Models

Regression models describe the relationship between one dependent variable and one or more independent variables, and can be applied to a myriad of applications from predicting sales to weather forecasting. Unlike the discrete output in classification, the output of regression models is continuous. Hence, we need different types of metrics to compute the performance of these models.

Mean Absolute Error (MAE)

The MAE is the sum of the absolute difference between the actual and predicted values in a dataset. It gives us an idea of how wrong the model predictions are as the scores increase linearly with an increase in error. Mathematically, it is represented as:

where Y are the actual values and Y are the predicted values.

Getting the perfect MAE score of 0 means all predictions are correct, which is nearly impossible. To gauge your MAE score, use a simple predictive model to establish a baseline MSE, then test your model against this baseline to see whether it performs better than the simple model.

Mean Square Error (MSE)

The MSE is the mean of the squared differences between the actual and predicted values in a dataset.

Mean Square Error

where Y are the actual values and Y are the predicted values.

The squaring magnifies large errors such that models which have large error scores end up having a large average error score. Just like MAE, a good MSE is relative to your specific dataset and the baseline should be established first.

Root Mean Squared Error (RMSE)

The MSE value is often confusing to report because it is delineated with respect to the error made by the model generally, and not specific to a particular example. The RMSE, an extension of the MSE, is usually preferred since it solves this problem. It is simply the square root of the MSE.

Root Mean Squared Error

where Y are the actual values and Y are the predicted values.

Much like in MSE, a good RMSE is relative to your specific dataset and a baseline should be established first.

R Squared ( R2) – Coefficient of Determination

R2 is a statistical measure of the variance in the predictions of a model that confirms the goodness or fit of the predicted values to the actual values.

R Squared ( R2) - Coefficient of Determination

where Y are the actual values and Y are the predicted values.

R-squared indicates perfect model performance when it is 1, and bad when it is 0. The closer the value of r-square is to 1, the better the model fits.

Just like the classification metrics, sklearn also provides regression metrics that we can take advantage of as shown in this code:

from sklearn.metrics import mean_absolute_error,  mean_squared_error, r2_score
ground_truth = [6, 3, 0, -5, -1, 12, -8, 9, -3, 1]
model_predictions = [6.5, 0.2, -2.3, 3.6, -0.7, 5.0, 8.2, 12.1, 16.4, 7.0]
mae = mean_absolute_error(ground_truth, model_predictions)
mse = mean_squared_error(ground_truth, model_predictions)
rmse = mean_squared_error(ground_truth, model_predictions, squared=False)
r2 = r2_score(ground_truth, model_predictions)

Model validation

In validating a model, a practitioner confirms if the model output is acceptable in accordance with the ground truth dataset. A subset of your dataset can be held out for validation before training your machine learning model. You can use the discussed metrics to observe how the model performs on the validation set during training. When the model’s performance is satisfactory, you can stop training it and then test it on new data it has never seen before to see how it does. After the accuracy or performance of the model is calculated, it has to be compared with the confidence interval for the value obtained. Fundamentally, how will the performance of a model be after a certain number of model executions. This can be used to validate a model during a model selection process. We can say that instead of using only an accuracy metric to select the best model, confidence intervals can be introduced to give another layer of assurance.

For example, different classification models are trained and evaluated on the validation set to give varying accuracy scores. In a typical model selection methodology, it makes sense to pick the one with the highest accuracy. When confidence intervals are introduced, it checks the model’s level of confidence on the validation set. This gives the practitioner another layer of trust that the model can be used for the task.

With the insight from the confidence level, teams can modify the model to seek better performance in training and production. It can help practitioners select the best models based on accuracy that have been tested on different scenarios. You can read in detail how this is done from Mikel Petty’s paper on Calculating and Using Confidence Intervals for Model Validation.

Instead of doing this manually with your teams’ agreed metrics, remember that you can use an ML tool like deepchecks to thoroughly validate ML models with very little effort.

To explore all the checks and validations in Deepchecks, go try it yourself! Don’t forget to ⭐ their Github repo – it’s a big deal for open-source-led companies.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Recent Blog Posts

How to Train Generative AI Models
How to Train Generative AI Models
Uncovering Bias in Large Language Models
Uncovering Bias in Large Language Models

Webinar Event
Leveraging Open-Source Large
Language Models for Production 🚀
Sep 28th, 2023    5:00 PM CEST

Register NowRegister Now