# A Guide to Evaluation Metrics for Classification Models

## Introduction

Suppose you are trying to detect a rare disease based on a patient’s speech with a sophisticated ML model. You train your model and achieve 99% accuracy, you have saved humanity!

Hold on a minute. If more than 99% of patients do not have this disease, your model can simply predict the label “False” for any input and achieve a very high level of accuracy, but surely this is not what we intended…

It is worth noting that there are many different metrics that are relevant for other tasks such as regression, vision tasks, and NLP tasks which you should check out as well.

## Accuracy Accuracy gives us an overall picture of how much we can rely on our model’s prediction. This metric is blind to the difference between classes and types of errors, hence for imbalanced datasets accuracy, it is generally not enough.

```import numpy as np
from sklearn.metrics import accuracy_score
y_pred = np.array( * 1000)
y = np.array( * 8 +  * 992)
print(accuracy_score(y, y_pred))

Result: 0.992```

In this code snippet we defined an imbalanced dataset where over 99% of the examples have the label “0”, our baseline model will simply output “0” irrespective of the input. As we can see, this model achieves an accuracy score of 99.2%.

## Confusion Matrix

One way to ensure we are not blinded by the overall accuracy is to evaluate our model quality on each class independently. A popular way to visualize this is by using the confusion matrix. Plotting the confusion matrix for multiclass classification will also help us identify the most common mistakes, hence the name confusion matrix. Confusion matrix for medical diagnosis, the most common error is predicting virginica instead of versicolor (Source).

In the confusion matrix, the elements on the main diagonal represent the correct predictions, while the elements elsewhere represent errors.

Running the following code on our example will plot the appropriate confusion matrix:

```def plot_cm(data):
fig, ax = plt.subplots()
ax.matshow(data, cmap='Reds')
for (i, j), z in np.ndenumerate(data):
ax.text(j, i, '{:0.1f}'.format(z), ha='center', va='center')
plt.xlabel("y_pred")
plt.ylabel("y_true")
plt.show()

cm = confusion_matrix(y, y_pred)
plot_cm(cm)
cm = confusion_matrix(y, y_pred, normalize="true")
plot_cm(cm)```  Image: The confusion matrix for our example (default and normalized), the most common error is “false negative”

## False Positives and False Negatives

A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition such as a disease when the disease is not present, while a false negative is the opposite error where the test result incorrectly fails to indicate the presence of a condition when it is present.

• “False positives and false negatives”, Wikipedia

Following the example from Wikipedia, a true positive would be a correct prediction of the presence of a disease, while a true negative is a correct prediction of its absence.

It is very common to define the tolerance for each type of error independently since each error type has a different effect on reality. For example, failing to diagnose a patient with cancer can lead to late detection of the illness, and thus have fatal consequences. On the other hand, falsely diagnosing a patient with cancer might cause the patient much stress and anxiety, but they’re usually won’t be a long-term effect.

### Defining Positive and Negative

While binary classification can be a relatively symmetric task, often there is a way to define the labels in accordance with the typical setting of identifying a medical condition in a patient. Another way to think about these errors is whether we are sounding a false alarm (false positive), or failing to sound an alarm (false negative).

### Relation to the Confusion Matrix

For binary classification, the categories in the confusion matrix correspond directly to the four categories we’ve discussed: TP (true positive), TN (true negative), FP (false positive) and FN (false negative). Visualization of classification categories through the confusion matrix (Source).

## Precision and Recall

Using our new terminology we can now redefine accuracy in the following way: Accuracy is a class agnostic metric, and thus it does not grant us much information regarding the distribution of false positives and false negatives.

Precision and Recall are two popular metrics that do contribute to our understanding of the types of errors we have. Precision or positive predictive value gives us a measure for how much we can trust a positive prediction of our model. Recall, sensitivity or true positive rate (TPR) gives us a measure for how many of the real “true” values we detected. When we want to keep the false positives to a minimum, we want to increase the precision of our model, and when we want to reduce false negatives, we want to increase the recall.  Visualization of precision and recall (Source)

While in our original example we had 99.2% accuracy, we have undefined precision (division by 0) and 0% recall. Thus these metrics would be useful in detecting that our model is missing something important.

Returning to the confusion matrix, note that precision for a specific class can be perceived as taking the appropriate value on the diagonal and normalizing it by the sum of the values of the column. Similarly, recall is the result of normalizing a diagonal value by its row. Finally, accuracy is the result of normalizing the sum of the diagonal values by the sum of all matrix elements.

### F1 Score

Usually, there is a tradeoff between getting high precision and high recall, thus a common metric that gives a balanced overall score is the F1 score. F1 is defined as the harmonic mean of the precision and recall. ## ROC Curve

In order to visualize the tradeoff between false positives and false negatives for a given model, we can plot the receiver operating characteristic (ROC) curve, which essentially plots different possible values of TPR and FPR that are obtained by using different decision thresholds (the threshold for deciding whether a prediction is labeled “true” or “false”) for our predictive model.

For example, if we have a model that is meant to predict whether a financial transaction is fraudulent, which outputs the vector (0.1, 0.7, 0.95, 0.5), we will obtain the following binary predictions:

Threshold=0.2: (0, 1, 1, 1)
Threshold=0.8: (0, 0, 1, 0)

The decision threshold is used as a knob to control the tradeoff between our desire for high TPR and low FPR. Increasing the threshold will generally result in an increase in the precision, but a decrease in recall. A rounder ROC curve represents a more precise model, the point (0,1) represents a “perfect classifier” (source)

Following is a code implementation for plotting a ROC curve that matches our toy example, we can also plot the matching thresholds for each point on the curve (image on the right). Note that we need our predictions to be probabilistic in order to plot such a curve since the different points in the graph are generated by changing the decision threshold. We attempt to generate a probabilistic prediction of an “okay” classifier.

```from sklearn.metrics import roc_curve
preds_for_label_true = np.random.normal(0.8, 0.5, 8).clip(0, 1)
preds_for_label_false = np.random.normal(0.2, 0.5, 992).clip(0, 1)
y_score = np.append(preds_for_label_true, preds_for_label_false)
fpr, tpr, thresholds = roc_curve(y, y_score=y_score)
plt.plot(fpr, tpr)
plt.title("ROC Curve")
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.show()``` The ROC Curve can be used to select a sweet spot, where the TPR is high, but the FPR is low. As the threshold decreases, both TPR and FPR increase (source)

### Precision Recall Curve

Another related curve worth noting is the precision recall curve. This curve gives us direct information about different values we can achieve of precision and recall. However, it is important to note that precision is not necessarily monotonous with regard to the prediction threshold, even though generally precision increases as the threshold increases. Thus this graph can be a little more challenging to analyze in some cases. Precision recall curve for our example. The graph can be less “clean” but is useful when prediction quality is measured by precision and recall.

### AUC

AUC, or area under the curve, is a popular metric that is used to summarize a graph by using a single number. Usually, the curve referred to is the ROC curve, and thus the term is short for ROC AUC. AUC is also equal to the probability that our classifier will predict a higher score for a random positive example, than for a random negative example.

```from sklearn.metrics import roc_auc_score
print(roc_auc_score(y, y_score))

Output: 0.727

```

Code snippet for calculating the ROC AUC score.

AUC is a metric that is helpful in comparing different models since it summarizes the data from the whole ROC curve. However, at the end of the day you’ll still need to look at other metrics as well in order to decide on the desired threshold for your model.

## Conclusion

In conclusion, we have seen in this post that accuracy is only a part of the story of your model’s performance, especially when working with imbalanced data or in situations where a false positive has a larger impact or vice versa. We discussed some additional common metrics for classification tasks and learned about the precision-recall trade-off and how to select the optimal decision threshold for your classifier.   