🎉 Deepchecks’ New Major Release: Evaluation for LLM-Based Apps!  Click here to find out more 🚀

How to Check the Accuracy of Your Machine Learning Model


Accuracy is perhaps the best-known Machine Learning model validation method used in evaluating classification problems. One reason for its popularity is its relative simplicity. It is easy to understand and easy to implement. Accuracy is a good metric to assess model performance in simple cases.

However, in real-life scenarios, modeling problems are rarely simple. You may need to work with imbalanced datasets or multiclass or multilabel classification problems. Sometimes, a high accuracy might not even be your goal. As you solve more complex ML problems, calculating and using accuracy becomes less obvious and requires extra consideration.

For this reason, it is important to understand what accuracy is, how to calculate it, and what its weaknesses are in different machine learning contexts.

This article gives you an overview of accuracy as a classification metric. It explains its definition, shows you how to use it in a binary, multiclass, and multilabel context, and identifies its main issues.

You can find the full code behind the examples here.


Accuracy is used in classification problems to tell the percentage of correct predictions made by a model. Accuracy score in machine learning is an evaluation metric that measures the number of correct predictions made by a model in relation to the total number of predictions made. We calculate it by dividing the number of correct predictions by the total number of predictions.

This formula provides an easy-to-understand definition that assumes a binary classification problem. We discuss multiclass and multilabel classification problems in the second part of this article.

The Accuracy Paradox

The default form of accuracy gives an overall metric about model performance on the whole dataset.

However, overall accuracy in machine learning classification models can be misleading when the class distribution is imbalanced, and it is critical to predict the minority class correctly. In this case, the class with a higher occurrence may be correctly predicted, leading to a high accuracy score, while the minority class is being misclassified. This gives the wrong impression that the model is performing well when it is not.

For example, in cancer prediction, we cannot miss malignant cases. Neither should we diagnose benign ones as malignant. Doing so would put healthy people through serious treatment and decrease trust in the whole diagnostic process. But most times, the dataset contains a lot of data points in the benign class and few in the malignant class.

Let’s see an example. We will use the Wisconsin Breast Cancer dataset, which classifies breast tumor cases as benign or malignant.

Before modeling, we make the data imbalanced by removing most malignant cases, so only around 5.6% of tumor cases are malignant.

We also use only a single feature to make our model’s job harder.

Let’s see how well we can predict this situation.

Our model achieved an overall accuracy of ~0.9464 for the whole model. This result seems to be strikingly good.

However, if we take a look at the class-level predictions using a confusion matrix, we get a very different picture.

Our model misdiagnosed almost all malignant cases. The result is exactly the opposite of what we expected based on the overall accuracy metric.

The situation is a typical example of the accuracy paradox. While you achieve a high accuracy value, it gives you a false premise as your dataset is highly imbalanced, and mispredicting the minority class is costly.

In such situations, you try to predict rare but critical risks with systemic consequences. Examples are serious medical illnesses, economic crises, terrorist attacks, meteors, etc.

It does not matter if your model achieves 99.99% accuracy if missing a single case is enough to sabotage the whole system. Relying on the accuracy score as calculated above is not enough and can even be misleading.

Fortunately, you can mitigate this issue by considering your specific situation and asking questions like:

  • Is my data imbalanced?
  • What is the ‘cost’ of misdiagnosing a class?

If accuracy is not a suitable metric for evaluating your Machine Learning model performance, we have covered more appropriate metrics in other posts. Here are a few examples:

  • Precision: Percentage of correct predictions of a class among all predictions for that class.
  • Recall: Proportion of correct predictions of a class and the total number of occurrences of that class.
  • F-score: A single metric combination of precision and recall.
  • Confusion matrix: A tabular summary of True/False Positive/Negative prediction rates.
  • ROC curve: A binary classification diagnostic plot.

Besides these fundamental classification metrics, you can use a wide range of further measures. This table summarizes a number of them:

Ultimately you need to use a metric that fits your specific situation, business problem, and workflow and that you can effectively communicate to your stakeholders.

This might even mean coming up with your metric.

We have learned about using accuracy in binary problems. Let’s look at cases where we have to predict multiple classes.

Accuracy in Binary Classification

In the binary classification case, we can express accuracy in True/False Positive/Negative values. The accuracy formula in machine learning is given as:

Where there are only 2 classes, positive & negative:

  • TP : True Positives i.e. positive classes that are correctly predicted as positive.
  • FP : False Positives i.e negative classes that are falsely predicted as positive.
  • TN : True Negatives i.e. negative classes that are correctly predicted as negative.
  • FN : False Negatives i.e positive classes that are falsely predicted as negative.

All this is simple and straightforward. However, even this simple metric can be misleading. Let’s see an example.

Accuracy in Multiclass Problems

In a multiclass problem, we can use the same general definition as with the binary one. However, because we cannot rely on True/False binary definitions, we need to express it in a more general form:


  • n is the number of samples.
  • [[…]] is the Iverson bracket which returns 1 when the expression within it is true and 0 otherwise.
  • yi and zi are the true and predicted output labels of the given sample, respectively.

Let’s see an example. The following confusion matrix shows true values and predictions for a 3-class prediction problem.

We calculate accuracy by dividing the number of correct predictions (the corresponding diagonal in the matrix) by the total number of samples.

The result tells us that our model achieved a 44% accuracy on this multiclass problem.

However, calculating an overall accuracy metric conceals class-level issues also in the multiclass case, so it is important to examine class-level predictions.

For example, let’s make predictions on the Iris dataset by using the sepal columns.

The overall accuracy is ~76.7%, which might not be that bad.

However, when we examine the results at the class level, the results are more diverse.

Accuracy is hard to interpret for individual classes in a multi-class problem, so we use the class-level recall values instead.

The confusion matrix shows that we correctly predicted all the ‘setosa’ types but had only 75% success with the ‘versicolor’ and 50% with the ‘virginica’ ones.

This example shows the limitations of accuracy in machine learning multiclass classification problems. We can use other metrics (e.g., precision, recall, log loss) and statistical tests to avoid such problems, just like in the binary case. We can also apply averaging techniques (e.g., micro and macro averaging) to provide a more meaningful single-number metric. For an overview of multiclass evaluation metrics, see this overview.

Accuracy in Multilabel Problems

Multilabel classification problems differ from multiclass ones in that the classes are mutually non-exclusive to each other. In ML, we can represent them as multiple binary classification problems.

Let’s see an example based on the RCV1 data set. In this problem, we try to predict 103 classes represented as a big sparse matrix of output labels. To simplify our task, we use a 1000-row sample.

When we compare predictions with test values, the model seems to be accurate.

However, this is not a meaningful result because it relies on the huge number of ‘Negative’ values in the class vectors. We have a similar problem as in the imbalanced binary case. Only now, we have many imbalanced class vectors where the majority classes are the ‘Negative’ values.

To better understand our model’s accuracy, we need to use different ways to calculate it.

Multilabel Accuracy or Hamming Score

In multilabel settings, Accuracy (also called Hamming Score) is the proportion of correctly predicted labels and the number of active labels (both real and predicted).


  • n is the number of samples.
  • Yi and Zi are the given sample’s true and predicted output label sets, respectively.

Multilabel Accuracy gives a more balanced metric because it does not rely on the ‘exact match’ criterion (like Subset Accuracy). It neither considers ‘True Negative’ values as ‘correct’ (as in our naive case).

The closer the hamming score is to one, the better the performance of the model.

Subset Accuracy and Multilabel Accuracy are not the only metrics for multilabel problems and are not even the most widely used ones. For example, Hamming Loss is more appropriate in many cases.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Hamming Loss

Hamming loss is the ratio of wrongly predicted labels. It can take values between 0 and 1, where 0 represents the ideal scenario of no errors.


  • n is the number of samples.
  • k is the number of labels.
  • Yi and Zi are the given sample’s true and predicted output label sets, respectively.
  • is the symmetric difference

The main reason behind its popularity is its simplicity:

The closer the hamming loss is to zero, the better the performance of the model.

Besides these measurements, you can use the multilabel version of the same classification metrics you have seen in the binary and multiclass case (e.g., precision, recall, F-score). You can also apply averaging techniques (micro, macro, and sample-based) or ranking-based metrics.

For an overview of multilabel metrics, see this review article or this book on the topic.

Subset Accuracy or Exact Match Ratio

Subset Accuracy (also called Exact Match Ratio or Labelset Accuracy) is a strict version of the accuracy metric where a “correct” prediction requires all the labels to match for a given sample. *


  • n is the number of samples.
  • [[…]] is the Iverson bracket which returns 1 when the expression within it is true and 0 otherwise.
  • Yi and Zi are the given sample’s true and predicted output label sets, respectively. (Please note that we compare full label sets here, not single labels.)

Because we work with a relatively large number of labels, correctly predicting all of them is very hard. Not surprisingly, Subset Accuracy shows very low performance for our model.

This metric does not give information about partial correctness because of the strict criterion it relies on. If our model fails to predict only a single label from the 103 but performs well on the rest, Subset Accuracy still categorizes these predictions as failures.

To balance this, we can use other metrics that reflect more partial correctness.

Further Accuracy Types

We have reviewed the most important cases to measure accuracy in binary, multiclass, and multilabel problems. However, there are additional variations of accuracy which you may be able to use for your specific problem.

Here are the most widely used examples:

When to use Accuracy Score in Machine Learning

Accuracy score should be used when you want to know the skill of a model to classify data points correctly, irrespective of the prediction performance per class or label. It gives you an intuition for whether the data you have is suitable for your classification problem.

If you need to utilize the accuracy metric in your project, there are very simple to use packages like deepchecks that give you indepth reports on relevant metrics to evaluate your model. This makes it easier for you to better understand your model’s performance.

Be Sure How to Measure the Accuracy of Your Machine Learning Model

Whatever metric you choose, you should know what it is good for, its caveats, and what processes you can use to validate against its common pitfalls.

The bigger the ML projects you have, the more complex the system of metrics you need to monitor. You have to learn about them, know how to implement them, and keep them in check continuously. These tasks can become hard to maintain and tend to introduce wrong metrics, wrong measurements, and wrong interpretations.

One way to make model evaluation, validation, and monitoring easier is to utilize ML solutions  like deepchecks at the different stages of the ML lifecycle. It  provides a broad range of already tried and tested metrics with worked-out implementation and detailed documentation.

Using Deepchecks, you can choose from a wide range of verified and documented metrics so you can better understand the workings of your Machine Learning models and trust them more.

Are you interested in how? Get started.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo