## Introduction

Machine learning (ML) models have become increasingly prevalent in domains from image recognition to natural language processing. Developing and deploying the binary classification models demand an understanding of their performance, often evaluated using metrics such as accuracy, precision, recall, F1 score, ROC-AUC, and PR-AUC. These metrics provide insights into different elements of model performance, such as the trade-off between precision and recall, the ability to handle imbalanced datasets, and the ability to classify samples correctly. Nevertheless, employing these metrics requires understanding their underlying calculations, limitations, and applicability to specific problems.

This article will help you understand these key metrics by exploring their underlying calculations and relevance in different use cases. We will also discuss each metricâ€™s strengths and weaknesses and how they can be used in conjunction with other metrics to provide a more comprehensive view of model performance.

## Binary Classification Metrics

*Note:*

**True Positive (TP):**model correctly predicts the positive class**True Negative (TN):**model correctly predicts the negative class**False Positive (FP):**model predicts positive, but it’s negative.**False Negative (FN):**model predicts negative, but it’s positive

*These terms generate the confusion matrix, which will be used to derive evaluation metrics, as seen in the following sections.Â *

### Accuracy

**Accuracy** is an ML metric that measures the proportion of correct predictions made by a model over the total number of predictions made. It is one of the most widely used metrics to evaluate the performance of a classification model. Its formula is as follows:

Accuracy is a simple and intuitive metric that is easy to understand and interpret. It is particularly useful when the classes are balanced, meaning there are roughly equal numbers of positive and negative samples. In such cases, accuracy can provide a good overall assessment of the modelâ€™s performance.

**Example implementation:**

from sklearn.metrics import accuracy_score y_true = [0, 1, 0, 1, 0, 1, 1, 0] y_pred = [1, 1, 0, 1, 0, 0, 1, 0] accuracy = accuracy_score(y_true, y_pred) print("Accuracy Score:", accuracy)

In this example, we have two arrays, y_true and y_pred, representing the true and predicted labels of a binary classification problem. We then use the accuracy_score function from scikit-learn to calculate the accuracy by passing in the true and predicted labels. In this case, the accuracy score of 0.75 means it properly categorized 75% of both classes in all cases.

However, accuracy can be misleading when the classes are imbalanced. For example, if 95% of the samples are negative and only 5% are positive, a model that always predicts negative would achieve an accuracy of 95%. Still, it would be useless for the positive class. In such cases, other metrics such as precision, recall, F1 score, and area under the precision-recall curve should be used to evaluate the modelâ€™s performance.

### Precision

Precision is the proportion of true positive predictions out of all positive predictions made by the model. It simply measures the accuracy of positive predictions.

**Example implementation:**

from sklearn.metrics import precision_score y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1] y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1] precision = precision_score(y_true, y_pred) print("Precision Score:", precision)

In this example, we have two arrays, y_true and y_pred, representing the true and predicted labels of a binary classification problem. We then use the precision_score function from scikit-learn to calculate the precision by passing in the true and predicted labels. The precision score in this case is 0.6, meaning 60% of the positive predictions are correct. However, focusing only on precision doesn’t provide a full understanding of the model’s performance since it fails to account for false negatives.

### Recall

Recall (sensitivity/true positive rate) is the proportion of true positive predictions from all actual positive samples in the dataset. It measures the model’s ability to identify all positive instances and is critical when the cost of false negatives is high.

**Example implementation:**

from sklearn.metrics import recall_score y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1] y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1] recall = recall_score(y_true, y_pred) print("Recall Score:", recall)

In this example, we have two arrays, y_true and y_pred, representing the true and predicted labels of a binary classification problem. We then use the recall_score function from scikit-learn to calculate the recall by passing in the true and predicted labels. In this example, the recall score is 0.6, indicating that the model correctly identified 60% of the actual positive instances. The recall and precision scores may have the same value in the differentiating terms false positives (FP) and false negatives (FN), resulting in equal values; however, they measure different aspects of the model performance. The weakness of this metric is its focus on correctly predicting only the positive class.

### F1 Score

The **F1 score** is a measure of a modelâ€™s accuracy that takes into account both precision and recall, where the goal is to classify instances correctly as positive or negative. Precision measures how many of the predicted positive instances were actually positive, while recall measures how many of the actual positive instances were correctly predicted. A high precision score means that the model has a low rate of false positives, while a high recall score means the model has a low rate of false negatives.

Mathematically speaking, the F1 score is a weighted harmonic mean of precision and recall. It ranges from 0 to 1, with 1 being the best possible score. The F1 score formula is:

The harmonic mean is used to give more weight to low values. This means that if either precision or recall is low, the F1 score will also be low, even if the other value is high. For example, if a model has high precision but low recall, it will have a low F1 score because it is not correctly identifying all of the positive instances.

**Example implementation:**

from sklearn.metrics import f1_score # Example data y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1] y_pred = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1] # Calculate F1 score f1 = f1_score(y_true, y_pred) # Print the F1 score print("F1 Score:", f1)

In this example, we have two arrays, y_true and y_pred, representing the true and predicted labels of a binary classification problem. We then use the f1_score function from scikit-learn to calculate the F1 score by passing in the true and predicted labels and setting the average parameter to â€˜weightedâ€™ for weighted averaging of the F1 score across labels. A high F1 Score indicates a balanced performance across precision and recall, like in our case, with a score of 0.9, meaning the model can distinguish both classes, which indicates a decent performance in terms of precision and recall for the given binary classification problem. It is readily applied in applications where FP and FN have consequences.

### ROC-AUC

The **ROC** (Receiver Operating Characteristic) curve and **AUC** (Area Under the Curve) are ML metrics used to evaluate the performance of binary classification models. ROC-AUC provides an aggregated performance measure across all possible classification thresholds. The ROC curve is a two-dimensional plot of the true positive rate against the false positive rate, showing the trade-off of both axes at various thresholds. The ROC curve is a plot of the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings, and it is created by varying the threshold to predict a positive or negative outcome and plotting the TPR against the FPR for each threshold. The TPR is the proportion of actual positive samples that are correctly identified as positive by the model.

In contrast, the FPR is the proportion of actual negative samples that are incorrectly identified as positive by the model. In the figure below, each colored line represents the ROC curve of a different binary classifier system. The axes represent the FPR and TPR. The diagonal line represents a random classifier, while the top-left corner represents a perfect classifier with TPR=1 and FPR=0. The AUC is the area under the curve made by the ROC curve. The AUC formula is to integrate the area under the ROC curve using the trapezoidal rule. Read more about the calculation here.

At the same time, the AUC represents the modelâ€™s overall performance. The AUC is the area under the ROC curve, representing the probability that a randomly chosen positive sample will be ranked higher by the model than a randomly chosen negative sample. A perfect model would have an AUC of 1, while a random model would have an AUC of 0.5. The AUC provides a single value that summarizes the modelâ€™s overall performance and is particularly useful when comparing the performance of multiple models.

The true and false positive rates at different thresholds are particularly useful when the classes are imbalanced, meaning there are significantly more negative samples than positive ones. In such cases, the ROC curve and AUC can provide a more accurate assessment of the modelâ€™s performance than metrics such as accuracy or F1 score, which may be biased toward the majority class.

**Example implementation:**

from sklearn.metrics import roc_curve, roc_auc_score import matplotlib.pyplot as plt # Example data y_true = [0, 1, 0, 1, 0, 1, 1, 0] y_scores = [0.6, 0.8, 0.3, 0.9, 0.2, 0.7, 0.5, 0.4] # Calculate ROC curve fpr, tpr, _ = roc_curve(y_true, y_scores) # Calculate ROC-AUC score roc_auc = roc_auc_score(y_true, y_scores) # Print ROC-AUC Score print('ROC-AUC Score:', roc_auc) # Plot ROC curve plt.figure() plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:0.2f})') plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver Operating Characteristic (ROC) Curve') plt.legend(loc="lower right") plt.show()

In the example presented above, there are two arrays, **y_true** and **y_scores**, representing the true labels and predicted scores of a binary classification problem. The predicted scores are continuous values ranging between 0 and 1, representing the modelâ€™s confidence in predicting the positive class. We then use the **roc_auc_score** function from **scikit-learn** to calculate the ROC-AUC score by passing in the true labels and predicted scores. The calculated ROC-AUC score is 0.9375 or 93.75%, which means that the model can distinguish between positive and negative instances with 93.75% of the time. The higher the ROC-AUC score, the better the modelâ€™s performance in terms of its ability to distinguish between the two classes.

### PR-AUC

**PR-AUC** (Precision-Recall Area Under the Curve), also known as the precision-recall curve, is an ML metric used to evaluate the performance of binary classification models, mainly when the classes are imbalanced. Unlike the ROC curve and AUC, which plot the TPR against the FPR, the PR curve plots the precision against the recall at different threshold settings, where its value is the integration of the area under the curve.

Precision is the proportion of true positive predictions out of all positive predictions made by the model, while recall is the proportion of true positive predictions from all actual positive samples in the dataset. The PR curve is created by varying the threshold for predicting a positive or negative outcome and plotting the precision against the recall for each threshold.

The PR-AUC is the area under the PR curve and represents the modelâ€™s overall performance. A perfect model would have a PR-AUC of 1, while a random model would have a PR-AUC equal to the ratio of positive samples in the dataset. Like the AUC, the PR-AUC provides a single value that summarizes the modelâ€™s overall performance and is particularly useful when comparing the performance of multiple models. In the figure above, the grey dotted line represents a baseline classifier â€” this classifier would simply predict that all instances belong to the positive class. The purple line represents an ideal classifier with perfect precision and recall at all thresholds.

The PR curve and PR-AUC provide a more accurate assessment of the modelâ€™s performance than metrics such as accuracy or F1 score, which may be biased toward the majority class. In addition, they can provide insight into the trade-off between precision and recall and help to identify the optimal threshold for making predictions.

**Example implementation:**

from sklearn.metrics import precision_recall_curve, auc import matplotlib.pyplot as plt # Example data y_true = [0, 1, 0, 1, 0, 1, 1, 0] y_scores = [0.6, 0.8, 0.3, 0.9, 0.2, 0.7, 0.5, 0.4] # Calculate precision-recall curve precision, recall, _ = precision_recall_curve(y_true, y_scores) # Calculate PR-AUC score pr_auc = auc(recall, precision) # Print PR-AUC Score print('PR-AUC Score:', roc_auc) # Plot Precision-Recall curve plt.figure() plt.plot(recall, precision, color='blue', lw=2, label=f'PR curve (area = {pr_auc:0.2f})') plt.xlabel('Recall') plt.ylabel('Precision') plt.title('Precision-Recall Curve') plt.legend(loc="lower left") plt.show()

In this example, we have two arrays, y_true and y_scores, representing the true labels and predicted scores of a binary classification problem. The predicted scores are continuous values ranging between 0 and 1, representing the modelâ€™s confidence in predicting the positive class. We then use the precision_recall_curve function from scikit-learn to calculate the precision-recall curve, a precision vs. recall plot at various thresholds. We also use the auc function from scikit-learn to calculate the area under the precision-recall curve, which is the PR-AUC score. The result of 0.9375 or 93.75% means the model has a great balance between precision and recall. The higher the PR-AUC score, the better the modelâ€™s performance in terms of its ability to balance precision and recall for the positive class.

## Multiclass Classification Metrics

Classification problems extend beyond binary settings to include multiple classes, with ML algorithms readily available to aid such predictive analysis. Evaluating the resulting model performance requires adaptation of the binary classification metrics to handle more classes. For more on multiclass evaluation metrics, read this paper.

Multi-class classification introduces some complexity as compared with binary classification, such as varying class distributions (with multiple minority classes), its interpretation, and aggregation (combining performance metrics across multiple classes).

### Adaptation of Binary Metrics for Multi-Class: Example with F1 Score

In this section, we will be adapting the F1 score metrics for a multi-class problem by implementing various aggregating strategies for the general evaluation of the model.

from sklearn.metrics import f1_score # Example data y_true = [0, 1, 2, 2, 0, 1, 2, 0] y_pred = [0, 2, 1, 2, 0, 0, 1, 0] # Calculate F1 scores using different averaging methods f1_macro = f1_score(y_true, y_pred, average='macro') f1_micro = f1_score(y_true, y_pred, average='micro') f1_weighted = f1_score(y_true, y_pred, average='weighted') print("Macro Averaged F1 Score:", f1_macro) print("Micro Averaged F1 Score:", f1_micro) print("Weighted Averaged F1 Score:", f1_weighted)

**Macro Averaging:**This aggregation technique computes the F1 score independently for each class and then takes the average, treating all classes equally.

**Micro Averaging:**This technique calculates the number of TP, TN, FP, and TN and uses this to calculate the F1 score.

**Weighted Averaging:**This technique builds upon the Macro Average above but takes the weighted average based on the number of instances in each class

## Summary of Strength and Weakness of Metrics

Metrics | Strengths | Weaknesses |

Accuracy | Easy to understand and compute; provides a general performance measure. | It can be misleading in imbalanced datasets and does not differentiate between types of errors. |

Precision | Useful when the cost of false positives is high; measures the accuracy of positive predictions. | It does not account for false negatives and can be less informative if not considered with recall. |

Recall | Crucial when the cost of false negatives is high; it measures the ability to identify positive instances. | It does not account for false positives and can be less informative if not considered with precision. |

F1 Score | Balances precision and recall; useful in imbalanced datasets. | It does not account for true negative rates |

ROC-AUC | It provides a comprehensive performance measure across all thresholds and is useful in imbalanced datasets. | Does not take into account the cost or benefit of different types of errors. |

PR-AUC | Focuses on performance with respect to the positive class; useful in imbalanced datasets. | It can be less intuitive to interpret. |

**Table:** Strength and Weakness of Metrics

## Conclusion

The article discusses the importance of various metrics for evaluating the performance of binary and multiclass classification models. The F1 score is a useful metric that balances precision and recall but should not be used in isolation as it does not account for true negatives. Accuracy is a simple metric but should be used cautiously, especially for imbalanced classes. The ROC curve and AUC are comprehensive metrics that evaluate the performance of a model across different thresholds and are particularly useful for imbalanced classes. Finally, PR-AUC is a metric that measures the overall performance of a binary classification model by plotting precision against recall at different threshold settings, providing a more accurate assessment of performance for imbalanced classes.

In conclusion, knowing and picking the correct classification performance metrics is important for your project, as it sets it up for failure or success. This analysis is not straightforward; it involves understanding your dataset, the trade-offs you are willing to accept, and the implications of predictions. Thanks for reading!