If you like what we're working on, please  star us on GitHub. This enables us to continue to give back to the community.

Intro to Model Performance Metrics

In this blogpost, we’re going to walk you through the motivation for using metrics for model performance, what metrics are, and the commonly used metrics in different cases.

Why do we need ML model metrics?

Say you want to use an ML model to solve a problem. You are probably going to ask yourself:

  • What degree of performance is considered good enough to adopt the model?
  • What model should I choose, which one is better than the others?
  • Did the performance degrade over time?

After considering these points, you’ll need to communicate those resolutions to your stakeholders (colleagues and clients). When they ask how well a model performs, it would benefit them to receive an answer that’s more informative than just “good.” The assessment of how well the model performs drives informed decision making regarding the usage of that model.

Metrics allow us to better understand how well the model is doing without diving into individual samples and details.

What is a metric?

A metric is a function that quantifies the model’s performance into a single score.

Ideally, a metric approximates how well the model would perform in the real world. While it’s only calculated on a certain subset of the data, the score on this subset is normally used as a proxy of the performance on the data that it will encounter in the future.

What is a good model performance metric?

Metrics are meant to drive decision making. For a given use-case, measure the criteria that represent the key factors in the decision making process.

There is no single criteria in measuring how good a metric is, but for it to be useful, the following should generally apply:

  1. Deterministic. A reliable metric is reproducible, meaning that given the same data and the same model, the metric will always return the same result. A metric cannot rely on personal opinions.
  2. Comparable. The result has to be ordinal so we can determine whether or not the current result is better than the previous result; and on a clear scale so we would also answer how much better the current result is compared to the previous.
  3. Explainable. Communicate the results to other stakeholders and form an impression of the meaning of the results.
  4. Resonates with domain/business sense. This is too often overlooked. We have to make sure that we are measuring the right thing – the level of success in the actual problem the model is meant to solve. For example, if we use object detection for inventory count of a certain product, we care more about how many targets are detected rather than if the model found the exact location of the products (IoU).

Object detection for inventory count by Sol Yarkoni.

Other consideration for choosing a metric might include:

  1. Using an academy or industry standard.
  2. Time for implementation (we might prefer a metric that can be used off-the-shelf).
  3. Computation time and memory.
Open source package for ml validation

Build Test Suites for ML Models & Data with Deepchecks

Get StartedOur GithubOur Github

How are metrics different from loss functions?

METRICSLOSS FUNCTIONS
Used for model evaluation.Used for model optimization.
Metrics need to satisfy mathematical properties but are less strict than loss functions.Used as objectives in optimization problems (need to satisfy some mathematical properties that metrics don’t, such as differentiability).
Used only once at the end of the process (except in some specific cases like early stopping conditions).Used multiple times during the optimization; calculation speed is crucial.
Outward (presenting the results to the user); explainability is crucial.Inward (the results are returned to the optimizer, the entity managing the optimization process)
Often larger results = better performanceSmaller result = better performance
Aggregative (meaningless for a single sample)Can be applied to a single sample to derive the optimization step size.

Many functions can be used as both metric and loss function. In this case, an adjustment for the direction (minimize loss or maximize metric) like a minus sign is commonly used.

Common Metrics

Regression

  • MSE (Mean Squared Error). The average squared difference between the estimated values and the actual values. The most common metric for regression.
  • RMSE/RMSD (Root Mean Square Error/ Deviation). The square root of MSE. Also commonly used.
  • MAE (Mean Absolute Error). The average absolute difference between the estimated values and the actual values.
  • Cosine Similarity. The cosine of the angle between two sequences or vectors of numbers. It does not depend on the magnitudes of the vectors, but their angle and similar vectors are vectors pointing in the same direction.
  • R2/Coefficient of Determination. The ratio between the variance of the model’s predictions and the variance of the ground truth. Can be viewed as how much of the variation in the model results comes from the model.

Classification

Generally, the common classification metrics are based on the concept of comparing the predicted label with the ground truth label and counting the matches and mismatches between them.

This is usually done with the assistance of a confusion matrix. If you are not familiar with the concept of confusion matrix, check out this link.

Actual Values
PositiveNegative
Predicted ValuesPositiveTrue Positive
(TP)
False Positive
(FP)
NegativeFalse Negative
(FN)
True Negative
(TN)

Confusion Matrix illustration by Sol Yarkoni.

  • Accuracy. The number of correctly classified samples out of the total number of samples. Very intuitive but often misleading, especially for imbalanced data. Read more about accuracy in this post.
  • Precision. The number of samples correctly classified as positive out of the total number of samples classified as positive. Can be regarded as the fraction of relevant samples out of the samples spotted by the model.
  • Recall / Sensitivity / TPR. The number of samples correctly classified as positive out of the total number of positive samples. Can be regarded as the fraction of the relevant samples that were spotted by the model.

Source

  • Specificity/ TNR. The number of samples correctly classified as negative out of the total number of negative samples. Complementary to sensitivity.

Source

  • F-1. Combines precision and recall into one metric by taking their harmonic mean. More robust to class imbalance than accuracy.
  • AUC. The area under the curve of the ROC graph. For the previous classification metrics, a threshold on the model output was chosen above which the sample is classified as positive and below as negative. The AUC takes into account all possible thresholds.

Read more about classification metrics in this post.

Object Detection

  • IoU/ Jaccard Index. The ratio between the overlapping area of predicted and the actual bounding box and the union of their areas. The most intuitive metric for object detection.
  • mAP (Mean Average Precision). The mean of the average precision per class over the classes. Calculated at a certain threshold, usually 0.5. Commonly used for benchmarking object detection models. A good explanation of how it is calculated can be found here.
  • mAR. Mean Average Recall. The mean of the average precision per class over the classes. Summed over the threshold range [0.5, 1].

Multi-class Aggregation

As demonstrated earlier, any classification metrics is based on the confusion matrix.

The terms True Positives, True Negatives, False Positives, and False Negatives are clear in the binary case when a sample is either positive or negative. But how do we calculate the confusion matrix and the metrics derived from it when there are more than 2 classes?

For example, the goal of MNIST is to classify an image of a handwritten digit to one of the digits 0-9, so the classes are {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} instead of {positive, negative}.

Source

The confusion matrix for an imaginary model on MNIST would look something like:

To use binary metrics for multiple classes we split the problem into multiple binary problems.

The two methods are:

  • One Vs. All (OVA) or One Vs. Rest (OVR). For each class, we separate the samples into 2 groups: belonging to that class, or belonging to any other class.In this case, the metrics will return a result per class. The number of results is equal to the number of classes.
  • One Vs. One (OVA). We check each class against the other classes one by one: belonging to the first class, or the second class. In this case, the metrics will return a result per class pair. Assuming that the metrics are symmetric, the number of results for N classes is

After the metrics per class or class pair are calculated, they can be presented as is or aggregated into a single number. Averaging (aggregation over the classes) methods:

To demonstrate, we’ll take the formula for precision per class:

where C is the number of classes.

  • Micro. The metric is calculated from the total counts of the TP, FP, etc., from all of the samples regardless of the class they belong to.

For example micro averaged recall Total Recall  = 

  • Macro. First calculate the metric per class, and then take the average between the class results.

Total Recall = 

  • Weighted (macro). First calculate the metric per class, and then take the weighted average between the class results. The weights are usually the class frequency, so that classes with more samples have a stronger effect on the total result.

Total Recall = 

Choosing the Test Dataset

This post is about metrics, but it’s worth mentioning the data that the metrics run on.

Some general characteristics of a good dataset include:

  • Only unseen samples; samples that didn’t appear in the train set. To assess how well the model actually learned, we need to eliminate data that it could have memorized during the training process.
  • Large enough to yield statistically meaningful results.
  • The data distribution should be as similar as possible to the real world. Since the metrics are aggregative, they’re affected by the underlying distribution of the data.

Read more about test sets in this post.

Using ML Model Performance Metrics with Deepcheck

Want to use metrics to evaluate the performance of your model on your data?

Check out our metrics guide!

Subscribe to Our Newsletter

Do you want to stay informed? Keep up-to-date with industry news, the latest trends in MLOps, and observability of ML systems.

Related articles

How to Automate Data Drift Thresholding in Machine Learning
How to Automate Data Drift Thresholding in Machine Learning
Best Practices for Computer Vision Model Deployment
Best Practices for Computer Vision Model Deployment
Benefits of MLOps Tools for ML Data
Benefits of MLOps Tools for ML Data