Introduction
Machine learning (ML) models have become increasingly prevalent in different domains, from image recognition to natural language processing. Developing and deploying the binary classification models demand an understanding of their performance, often evaluated using metrics such as F1 score, accuracy, ROC-AUC, and PR-AUC. These metrics provide insights into different elements of model performance, such as the trade-off between precision and recall, the ability to handle imbalanced datasets, and the ability to classify samples correctly. Nevertheless, employing these metrics requires understanding their underlying calculations, limitations, and applicability to specific problems.
This article will provide an understanding of these key metrics by exploring their underlying calculations and relevance in different use cases. We will also discuss the strengths and weaknesses of each metric and how they can be used in conjunction with other metrics to provide a more comprehensive view of model performance.
F1 Score
The F1 score is a measure of a model’s accuracy that takes into account both precision and recall, where the goal is to classify instances correctly as positive or negative. Precision measures how many of the predicted positive instances were actually positive, while recall measures how many of the actual positive instances were correctly predicted. A high precision score means that the model has a low rate of false positives, while a high recall score means that the model has a low rate of false negatives.
Mathematically speaking, the F1 score is a weighted harmonic mean of precision and recall. It ranges from 0 to 1, with 1 being the best possible score. The formula for the F1 score is:
F1 = 2 * (precision * recall) / (precision + recall)
The harmonic mean is used to give more weight to low values. This means that if either precision or recall is low, the F1 score will also be low, even if the other value is high. For example, if a model has high precision but low recall, it will have a low F1 score because it is not correctly identifying all of the positive instances. The F1 score calculation example in Python is given in the figure below.
from sklearn.metrics import f1_score # Example data y true = [0, 1, 0, 1, 0, 1, 1, 0] y_pred = [1, 1, 0, 1, 0, 0, 1, 0] # Calculate F1 score fiscore - f1_score(y_true, y_pred, average="weighted") # Print the F1 score print("F1 Score:", fiscore) F1 Score: 0.75
F1 score example
In this example, we have two arrays, y_true and y_pred, representing the true and predicted labels of a binary classification problem. We then use the f1_score function from scikit-learn to calculate the F1 score by passing in the true and predicted labels and setting the average parameter to ‘weighted’ for weighted averaging of the F1 score across labels. The calculated F1 score is 0.75 or 75%, which indicates a decent performance in terms of precision and recall for the given binary classification problem.
Accuracy
Accuracy is an ML metric that measures the proportion of correct predictions made by a model over the total number of predictions made. It is one of the most widely used metrics to evaluate the performance of a classification model.
Accuracy can be calculated using the following formula:
Accuracy = (number of correct predictions) / (total number of predictions)
Accuracy is a simple and intuitive metric that is easy to understand and interpret. It is particularly useful when the classes are balanced, meaning that there are roughly equal numbers of positive and negative samples. In such cases, accuracy can provide a good overall assessment of the model’s performance.
However, accuracy can be misleading when the classes are imbalanced. For example, if 95% of the samples are negative and only 5% are positive, a model that always predicts negative would achieve an accuracy of 95%. Still, it would be useless for the positive class. In such cases, other metrics such as precision, recall, F1 score, and area under the precision-recall curve should be used to evaluate the model’s performance. Here is another example of Accuracy score calculation in Python.
from sklearn.matrics import accuracy_score # Example data y_true = [8, 1, 0, 1, 0, 1, 1, 0) y_pred = [1, 1, 0, 1, 0, 0, 1, 0] # Calculate accuracy score accuracy = accuracy_score(y_true, y_pred) # Print the accuracy score print("Accuracy Score:", accuracy) Accuracy Score: 0.75
Accuracy score calculation
In this example, we have two arrays, y_true and y_pred, representing the true and predicted labels of a binary classification problem. We then use the accuracy_score function from scikit-learn to calculate the accuracy by passing in the true and predicted labels. The calculated accuracy score is 0.75, or the model has correctly classified 75% of instances in this example data.
ROC-AUC
The ROC (Receiver Operating Characteristic) curve and AUC (Area Under the Curve) are ML metrics used to evaluate the performance of binary classification models. The ROC curve is a plot of the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings, and it is created by varying the threshold to predict a positive or negative outcome and plotting the TPR against the FPR for each threshold. The TPR is the proportion of actual positive samples that are correctly identified as positive by the model. In contrast, the FPR is the proportion of actual negative samples that are incorrectly identified as positive by the model. In the figure below, each coloured line represents the ROC curve of a different binary classifier system. The axes represent the FPR and TPR. The diagonal line represents a random classifier, while the top-left corner represents a perfect classifier with TPR=1 and FPR=0.

ROC curve (Source)
At the same time, the AUC represents the overall performance of the model. The AUC is the area under the ROC curve, representing the probability that a randomly chosen positive sample will be ranked higher by the model than a randomly chosen negative sample. A perfect model would have an AUC of 1, while a random model would have an AUC of 0.5. The AUC provides a single value that summarizes the model’s overall performance and is particularly useful when comparing the performance of multiple models.
The true and false positive rates at different thresholds are particularly useful when the classes are imbalanced, meaning there are significantly more negative samples than positive ones. In such cases, the ROC curve and AUC can provide a more accurate assessment of the model’s performance than metrics such as accuracy or F1 score, which may be biased towards the majority class.
from sklearn.metrics import rec auc_score # Example data y_true = [0, 1, 0, 1, 0, 1, 1, 0) y_scores = [0.6, 0.8, 0.3, 0.9, 0.2, 0.7, 0.5, 0.4] #Calculate ROC-AUC score roc_auc - roc_auc_score(y_true, y_scores) # Print the ROC-AUC Score print("ROC-AUC Score:", roc_auc) ROC-AUC Score: 0.9375
ROC-AUC score calculation
In example, presented in figure above, we have two arrays, y_true and y_scores, representing the true labels and predicted scores of a binary classification problem. The predicted scores are continuous values ranging between 0 and 1, representing the model’s confidence in predicting the positive class. We then use the roc_auc_score function from scikit-learn to calculate the ROC-AUC score by passing in the true labels and predicted scores. The calculated ROC-AUC score is 0.9375 or 93.75% which means that the model can distinguish between positive and negative instances with a 93.75%. The higher the ROC-AUC score, the better the model’s performance in terms of its ability to distinguish between the two classes.
PR-AUC
PR-AUC (Precision-Recall Area Under the Curve) is an ML metric used to evaluate the performance of binary classification models, mainly when the classes are imbalanced. Unlike the ROC curve and AUC, which plot the TPR against the FPR, the PR curve plots the precision against the recall at different threshold settings.
Precision is the proportion of true positive predictions out of all positive predictions made by the model, while recall is the proportion of true positive predictions from all actual positive samples in the dataset. The PR curve is created by varying the threshold for predicting a positive or negative outcome and plotting the precision against the recall for each threshold.

PR curve (Source)
The PR-AUC is the area under the PR curve, and represents the overall performance of the model. A perfect model would have a PR-AUC of 1, while a random model would have a PR-AUC equal to the ratio of positive samples in the dataset. Like the AUC, the PR-AUC provides a single value that summarizes the model’s overall performance and is particularly useful when comparing the performance of multiple models. In the figure above, the grey dotted line represents a “baseline” classifier — this classifier would simply predict that all instances belong to the positive class. The purple line represents an ideal classifier with perfect precision and recall at all thresholds.
The PR curve and PR-AUC provide a more accurate assessment of the model’s performance than metrics such as accuracy or F1 score, which may be biased towards the majority class. In addition, they can provide insight into the trade-off between precision and recall and help to identify the optimal threshold for making predictions.
from sklearn.metrics import precision_recall_curve, auc # Example data y_true = (0, 1, 0, 1, 0, 1, 1, 0] y_scores = [0.6, 0.8, 0.3, 0.9, 0.2, 0.2, 0.8, 0.4] # Calculate precision recall curve and PR AUC score precision, recall,_= precision_recall_curve(y_true, y_scores) pr_auc = auc(recall, precision) # Print the PR-AUC Score print("PR-AUC Score:", pr_auc) PR-AUC Score: 0.875
PR-AUC score calculation
In this example, we have two arrays, y_true and y_scores, representing the true labels and predicted scores of a binary classification problem. The predicted scores are continuous values ranging between 0 and 1, representing the model’s confidence in predicting the positive class. We then use the precision_recall_curve function from scikit-learn to calculate the precision-recall curve, a precision vs recall plot at various thresholds. We also use the auc function from scikit-learn to calculate the area under the precision-recall curve, which is the PR-AUC score. The result of 0.875 or 87.5% means that the model has a good balance between precision and recall. The higher the PR-AUC score, the better the model’s performance in terms of its ability to balance precision and recall for the positive class.
Conclusion
The article discusses the importance of various metrics for evaluating the performance of binary classification models. The F1 score is a useful metric that balances precision and recall but should not be used in isolation as it does not account for true negatives. Accuracy is a simple metric but should be used cautiously, especially for imbalanced classes. The ROC curve and AUC are comprehensive metrics that evaluate the performance of a model across different thresholds and are particularly useful for imbalanced classes. Finally, PR-AUC is a metric that measures the overall performance of a binary classification model by plotting precision against recall at different threshold settings, providing a more accurate assessment of performance for imbalanced classes.