Would you like to know how to measure the accuracy of your machine learning model? Do you wonder if high accuracy automatically means good performance? Are you trying to measure accuracy on multiclass and multilabel problems?

Accuracy is perhaps the best-known machine learning model validation method used in classification problems. One reason for its popularity is its relative simplicity. It is easy to understand and easy to implement. Accuracy is a good metric to assess model performance for simple cases.

However, modeling problems are rarely simple. You need to work with imbalanced datasets or in a multiclass or multilabel setting. A high accuracy might not even be your goal. As you solve more complex machine learning problems, calculating and using accuracy becomes less obvious and requires extra consideration.

For this reason, it is important to understand what accuracy is, how to calculate it, and what its weaknesses are in different machine learning contexts.

This article gives you an overview of accuracy as a classification metric. It explains its definition, shows you how to use it in a binary, multiclass, and multilabel context, and identifies its main issues.

You can find the full code behind the examples here.

Let’s start with a simple definition.

## What is Accuracy?

Accuracy is a metric used in classification problems used to tell the percentage of accurate predictions. We calculate it by dividing the number of correct predictions by the total number of predictions.

This formula provides an easy-to-understand definition that assumes a binary classification problem. (We discuss multiclass and multilabel problems in the second part of this article.)

In the binary classification case, we can express accuracy in True/False Positive/Negative values.

Where

*TP*: True Positives*FP*: False Positives*TN*: True Negatives*FN*: False Negatives

All this is simple and straightforward. However, even this simple metric can be misleading. Let’s see an example.

## The Accuracy Paradox

The default form of accuracy gives an overall metric about model performance on the whole dataset.

However, overall accuracy can be misleading when the class distribution is imbalanced, and it is critical to predict the minority class correctly.

For example, in cancer prediction, we cannot miss malignant cases. Neither should we diagnose benign ones as malignant. Doing so would put healthy people through serious treatment and decrease trust in the whole diagnostic process.

Let’s see an example. We will use the Wisconsin Breast Cancer dataset, which classifies breast tumor cases as benign or malignant.

Before modeling, we make the data imbalanced by removing most malignant cases, so only around 5.6% of tumor cases are malignant.

We also use only a single feature to make our model’s job harder.

Let’s see how well we can predict this situation.

Our model achieved an overall accuracy of ~0.9464 for the whole model. This result seems to be strikingly good.

However, if we take a look at the class-level predictions, we get a very different picture.

Our model misdiagnosed almost all malignant cases. The result is exactly the opposite of what we expected based on the overall accuracy metric.

The situation is a typical example of the accuracy paradox. While you achieve a high accuracy value, it gives you a false premise as your dataset is highly imbalanced, and mispredicting the minority class is costly.

In such situations, you try to predict rare but critical risks with systemic consequences. Examples are serious medical illnesses, economic crises, terrorist attacks, meteors.

It does not matter if your model achieves 99.99% accuracy if missing a single case is enough to sabotage the whole system. Relying on accuracy is not enough and can even be misleading.

Fortunately, you can mitigate this issue by considering your specific situation (“Is my data imbalanced?”), the ‘cost’ of misdiagnosing a class, and using other metrics.

We have covered more appropriate metrics in other posts. Here are a few examples:

- Precision: Percentage of correct predictions of a class among all
**predictions**for that class. - Recall: Proportion of correct predictions of a class and the total number of
**occurrences**of that class. - F-score: A single metric combination of precision and recall.
- Confusion matrix: A tabular summary of True/False Positive/Negative prediction rates.
- ROC curve: A binary classification diagnostic plot.

Besides these fundamental classification metrics, you can use a wide range of further measures. This table summarizes a number of them:

Ultimately you need to use a metric that fits your specific situation, business problem, and workflow and that you can effectively communicate to your stakeholders.

This might even mean coming up with your metric.

We have learned about using accuracy in binary problems. Let’s look at cases where we have to predict multiple classes.

## Accuracy in Multiclass Problems

In a multiclass problem, we can use the same general definition as with the binary one. However, because we cannot rely on True/False binary definitions, we need to express it in a more general form:

Where

*n*is the number of samples.- [[…]] is the Iverson bracket which returns 1 when the expression within it is true and 0 otherwise.
*y*and_{i}*z*are the true and predicted output labels of the given sample, respectively._{i}

Let’s see an example. The following confusion matrix shows true values and predictions for a 3-class prediction problem.

We calculate accuracy by dividing the number of correct predictions (the corresponding diagonal in the matrix) by the total number of samples.

The result tells us that our model achieved a 44% accuracy on this multiclass problem.

Calculating an overall accuracy metric conceals class-level issues also in the multiclass case, so it is important to examine class-level predictions.

For example, let’s make predictions on the Iris dataset by using the sepal columns.

The overall accuracy is ~76.7%, which might not be that bad.

However, when we examine the results on the class level, the results are more diverse.

Accuracy is hard to interpret for individual classes in a multi-class problem, so we use the class-level recall values instead.

The confusion matrix shows that we correctly predicted all the ‘setosa’ types but had only 75% success with the ‘versicolor’ and 50% with the ‘virginica’ ones.

This example shows the limitations of accuracy in a multiclass setting. We can use other metrics (e.g., precision and recall) and statistical tests to avoid such problems, just like in the binary case. We can also apply averaging techniques (e.g., micro and macro averaging) to provide a more meaningful single-number metric. For an overview of multiclass evaluation metrics, see this overview.

## Accuracy in Multilabel Problems

Multilabel classification problems differ from multiclass ones in that the classes are mutually non-exclusive to each other. In machine learning, we can represent them as multiple binary classification problems.

Let’s see an example based on the RCV1 data set. In this problem, we try to predict 103 classes represented as a big sparse matrix of output labels. To simplify our task, we use a 1000 row sample.

When we compare predictions with test values, the model seems to be accurate.

However, this is not a meaningful result because it relies on the huge number of ‘Negative’ values in the class vectors. We have a similar problem as in the imbalanced binary case. Only now, we have many imbalanced class vectors where the majority classes are the ‘Negative’ values.

To better understand our model accuracy, we need to use different ways to calculate it.

## Subset Accuracy or Exact Match Ratio

Subset Accuracy (also called Exact Match Ratio or Labelset Accuracy) is a strict version of the accuracy metric where a “correct” prediction requires all the labels to match for a given sample.

Where

*n*is the number of samples.- [[…]] is the Iverson bracket which returns 1 when the expression within it is true and 0 otherwise.
*Y*and_{i}*Z*are the given sample’s true and predicted output label sets, respectively. (Please note that we compare full label sets here, not single labels.)_{i }

Because we work with a relatively large number of labels, correctly predicting all of them is very hard. Not surprisingly, Subset Accuracy shows a very low performance for our model.

This metric does not give information about partial correctness because of the strict criterion it relies on. If our model fails to predict only a single label from the 103 but performs well on the rest, Subset Accuracy still categorizes these predictions as failures.

To balance this, we can use other metrics that reflect more partial correctness.

## Multilabel Accuracy or Hamming Score

In multilabel settings, Accuracy (also called Hamming Score) is the proportion of correctly predicted labels and the number of active labels (both real and predicted).

Where

*n*is the number of samples.*Y*_{i}and*Z*_{i }are the given sample’s true and predicted output label sets, respectively.

Multilabel Accuracy gives a more balanced metric because it does not rely on the ‘exact match’ criterion (like Subset Accuracy). It neither considers ‘True Negative’ values as ‘correct’ (as our naive case).

Subset Accuracy and Multilabel Accuracy are not the only metrics for multilabel problems and are not even the most widely used ones. For example, Hamming Loss is more appropriate in many cases.

## Hamming Loss

Hamming loss is the ratio of wrongly predicted labels. It can take values between 0 and 1, where 0 represents the ideal scenario of no errors.

Where

*n*is the number of samples.*k*is the number of labels.*Y*and_{i}*Z*are the given sample’s true and predicted output label sets, respectively._{i }is the symmetric difference

The main reason behind its popularity is its simplicity:

Besides these measurements, you can use the multilabel version of the same classification metrics you have seen in the binary and multiclass case (e.g., precision, recall, F-score). You can also apply averaging techniques (micro, macro, and sample-based) or ranking-based metrics.

For an overview of multilabel metrics, see this review article or this book on the topic.

## Further Accuracy Types

We have reviewed the most important cases to measure accuracy in binary, multiclass, and multilabel problems. However, there are additional variations of accuracy which you may be able to use for your specific problem.

Here are the most widely used examples:

## Be Sure How to Measure the Accuracy of Your Machine Learning Model

So, what machine learning model validation method will you use? Whatever metric you choose, you should know what it is good for, its caveats, and what processes you can use to validate against its common pitfalls.

The bigger machine learning projects you have, the more complex system of metrics you need to monitor. You have to learn about them, know how to implement them and keep them in check continuously. These tasks can become hard to maintain and introduce wrong metrics, wrong measurements, and wrong interpretations.

One way to solve this is to use machine learning validation solutions, like Deepchecks, Evidently AI, or Neptune providing a broad range of already tried and tested metrics with worked-out implementation and detailed documentation.

Using Deepchecks, you can choose from a wide range of verified and documented metrics so you can better understand the workings of your machine learning models and trust them more.

Are you interested in how? Arrange a demo with us, and we will show it to you!