A Guide to Evaluation Metrics for Classification Models


Suppose you are trying to detect a rare disease based on a patient’s speech with a sophisticated ML model. You train your model and achieve 99% accuracy, you have saved humanity!

Hold on a minute. If more than 99% of patients do not have this disease, your model can simply predict the label “False” for any input and achieve a very high level of accuracy, but surely this is not what we intended…

In this blog post, we focus on different evaluation metrics for classification models. These metrics can help you assess your model’s performance, monitor your ML system in production, and control your model to fit your business needs.

It is worth noting that there are many different metrics that are relevant for other tasks such as regression, vision tasks, and NLP tasks which you should check out as well.


Accuracy gives us an overall picture of how much we can rely on our model’s prediction. This metric is blind to the difference between classes and types of errors, so for imbalanced datasets accuracy, it is generally not enough.

import numpy as np
from sklearn.metrics import accuracy_score
y_pred = np.array([0] * 1000)
y = np.array([1] * 8 + [0] * 992)
print(accuracy_score(y, y_pred))

Result: 0.992

In this code snippet, we defined an imbalanced dataset where over 99% of the examples have the label “0,” our baseline model will simply output “0” irrespective of the input. As we can see, this model achieves an accuracy score of 99.2%.

Confusion Matrix

One way to ensure we are not blinded by the overall accuracy is to evaluate our model quality on each class independently. A popular way to visualize this is by using the Confusion Matrix. Plotting this matrix for multiclass classification will also help us identify the most common mistakes, hence the name “confusion matrix.”

Confusion Matrix

Confusion Matrix for medical diagnosis, the most common error is predicting virginica instead of versicolor (Source).

In the Confusion Matrix, the elements on the main diagonal represent the correct predictions, while the elements elsewhere represent errors.

Running the following code on our example will plot the appropriate Confusion Matrix:

def plot_cm(data):
   fig, ax = plt.subplots()
   ax.matshow(data, cmap='Reds')
   for (i, j), z in np.ndenumerate(data):
       ax.text(j, i, '{:0.1f}'.format(z), ha='center', va='center')

cm = confusion_matrix(y, y_pred)
cm = confusion_matrix(y, y_pred, normalize="true")

Image: The Confusion Matrix for our example (default and normalized), the most common error is “false negative”

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

False Positives and False Negatives

A False Positive is an error in binary classification wherein a test result incorrectly indicates the presence of a condition such as a disease when the disease is not present. False Negative is the opposite, where the test result incorrectly fails to indicate the presence of a condition when it is present.

  • “False positives and false negatives,” Wikipedia

Following the example from Wikipedia, a true positive would be a correct prediction of the presence of a disease, while a true negative is a correct prediction of its absence.

It is common to define the tolerance for each type of error independently since each error type has a different effect on reality. For example, failing to diagnose a patient with cancer can lead to late detection of the illness, thus having fatal consequences. On the other hand, falsely diagnosing a patient with cancer might cause the patient stress and anxiety, but there usually won’t be a long-term effect.

Defining Positive and Negative

While binary classification can be a relatively symmetric task, there is often a way to define the labels in accordance with the typical setting of identifying a medical condition in a patient. Another way to think about these errors is whether we are sounding a false alarm (false positive), or failing to sound an alarm (false negative).

Relation to the Confusion Matrix

For binary classification, the categories in the Confusion Matrix correspond directly to the four categories we’ve discussed: TP (true positive), TN (true negative), FP (false positive) and FN (false negative).

Visualization of classification categories through the Confusion Matrix (Source).

Precision and Recall

Using our new terminology, we redefine accuracy as:

Accuracy is a class agnostic metric that does not grant us much information regarding the distribution of false positives and false negatives.

Precision and Recall are two popular metrics that do contribute to our understanding of the types of errors we have. Precision or positive predictive value gives us a measure for how much we can trust a positive prediction of our model.

Recall / Sensitivity / True Positive Rate (TPR) gives us a measure for how many of the real “true” values we detected.

To keep the false positives to a minimum, we increase the precision of our model, and when we want to reduce false negatives, we want to increase the recall.

Visualization of precision and recall (Source)

While in our original example we had 99.2% accuracy, we have undefined precision (division by 0) and 0% recall. These metrics would be useful in detecting our model is missing something important.

Returning to the Confusion Matrix, note that precision for a specific class can be perceived as taking the appropriate value on the diagonal and normalizing it by the sum of the values of the column. Similarly, recall is the result of normalizing a diagonal value by its row. Accuracy, therefore, is the result of normalizing the sum of the diagonal values by the sum of all matrix elements.

F1 Score

Usually, there is a tradeoff between getting high precision and high recall, a common metric that gives a balanced overall F1 Score. F1 is defined as the harmonic mean of precision and recall.

ROC Curve

To visualize the tradeoff between false positives and false negatives for a given model, we can plot the Receiver Operating Characteristic (ROC) Curve. This essentially plots different possible values of TPR and FPR that are obtained by using different decision thresholds (the threshold for deciding whether a prediction is labeled “true” or “false”) for our predictive model.

For example, if we have a model that is meant to predict whether a financial transaction is fraudulent which outputs the vector (0.1, 0.7, 0.95, 0.5), we will obtain the following binary predictions:

Threshold=0.2: (0, 1, 1, 1)
Threshold=0.8: (0, 0, 1, 0)

The decision threshold is used as a knob to control the tradeoff between our desire for high TPR and low FPR. Increasing the threshold will generally result in an increase in the precision, but a decrease in recall.

A rounder ROC curve represents a more precise model, the point (0,1) represents a “perfect classifier”

Below is a code implementation for plotting an ROC Curve that matches our toy example. We can also plot the matching thresholds for each point on the curve (image on the right). Note that we need probabilistic predictions to plot such a curve since the different points in the graph are generated by changing the decision threshold. We attempt to generate a probabilistic prediction of an “okay” classifier.

from sklearn.metrics import roc_curve
preds_for_label_true = np.random.normal(0.8, 0.5, 8).clip(0, 1)
preds_for_label_false = np.random.normal(0.2, 0.5, 992).clip(0, 1)
y_score = np.append(preds_for_label_true, preds_for_label_false)
fpr, tpr, thresholds = roc_curve(y, y_score=y_score)
plt.plot(fpr, tpr)
plt.title("ROC Curve")

Precision Recall Curve

Another related curve worth noting is the Precision Recall Curve. This curve gives us direct information about different values we can achieve of precision and recall. It is important to note, however, that precision is not necessarily monotonous with regard to the prediction threshold, even though generally precision increases as the threshold increases. This graph can be a little more challenging to analyze in some cases:

Precision Recall Curve for our example. The graph can be less “clean” but is useful when prediction quality is measured by precision and recall.


AUC (Area Under the Curve) is a popular metric used to summarize a graph by using a single number. Usually, the curve referred to is the ROC Curve –  the acronym is short for ROC AUC. AUC is also equal to the probability that our classifier will predict a higher score for a random positive example, than for a random negative example.

from sklearn.metrics import roc_auc_score
print(roc_auc_score(y, y_score))

Output: 0.727

Code snippet for calculating the ROC AUC score.

AUC is helpful in comparing different models since it summarizes the data from the whole ROC curve, but at the end of the day, you’ll still need to look at other metrics  to decide on the desired threshold for your model.


Accuracy is only a part of your model’s performance, especially when working with imbalanced data or where a false positive has a larger impact or vice versa. Do remember  the additional common metrics for classification tasks, the precision-recall trade-off, and how to select the optimal decision threshold for your classifier.

Further Reading

Precision and recall – Wikipedia
False positives and false negatives – Wikipedia
Metrics to evaluate your Machine Learning algorithm
20 popular Machine Learning metrics part 1
Confusion matrix
Evaluation a Machine Learning model

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Recent Blog Posts

Precision vs. Recall in the Quest for Model Mastery
Precision vs. Recall in the Quest for Model Mastery

Webinar Event
The Best LLM Safety-Net to Date:
Deepchecks, Garak, and NeMo Guardrails 🚀
June 18th, 2024    8:00 AM PST

Register NowRegister Now