How to Apply and Calculate the F1 Score in Machine Learning

If you would like to contribute your own blog post, feel free to reach out to us via blog@deepchecks.com. We typically pay a symbolic fee for content that’s accepted by our reviewers.

Introduction

The F1 score is a machine learning (ML) metric for evaluating model accuracy, combining precision and recall. While accuracy has long been a primary metric, it’s most effective for balanced datasets. The F1 score offers a more detailed evaluation, focusing on class-specific performance. Due to its integration of precision and recall, the F1 score is used for determining the effectiveness of models in complex fields like computer vision, where deep-learning algorithms are commonly used.

In this blog, we’ll explore what the F1 score is, how to calculate it, and its importance in ML.

Understanding the F1 score

The F1 score, also known as the F1 measure, is a statistical measure used to assess the accuracy of a model’s predictions. It considers both precision (the number of correct positive results divided by the number of all positive results) and recall (the number of correct positive results divided by the number of positive results that should have been identified). The F1 score is particularly useful in situations where there is an uneven class distribution, such as in cases of imbalanced datasets.

To calculate the F1 score, you use the formula:

This ensures that both precision and recall are taken into account, providing a more balanced view of a model’s performance than simply considering accuracy alone.

Let’s break down the F1 score calculation into steps:

1. Calculate precision

Precision measures the accuracy of the positive predictions made by the model. In other words, it answers the question: “Out of all the instances the model labeled as positive, how many are actually positive?” Precision is calculated as the ratio of true positives (correct positive predictions) to the sum of true and false positives (all positive predictions).

2. Calculate recall

Also known as sensitivity, recall assesses the model’s ability to identify all relevant instances. It answers: “Out of all the actual positive instances, how many did the model correctly identify?” Recall is the ratio of true positives to the sum of true positives and false negatives (all actual positive instances).

3. Apply the F1 formula

Use the F1 score formula to calculate the F1 score by substituting your precision and recall values.

The F1 score range is between 0 and 1, where 1 indicates perfect precision and recall, and 0 indicates the poorest. A higher F1 score suggests a better balance of precision and recall, indicating a more effective model.

F1 Score in ML

The F1 score is particularly important for evaluating classification models, especially in cases where there are imbalanced classes. Its utility arises from its ability to balance precision and recall, providing a single, comprehensive measure of a model’s performance. Regarding balancing precision and recall, we should mention the specific holistic performance measure of the F1 Score. In many real-world scenarios, merely maximizing accuracy is not sufficient. Precision (the quality of the positive predictions) and recall (the model’s ability to detect all positive instances) often have a trade-off. The F1 score harmonizes these two aspects, ensuring that a model does not overly favor one at the expense of the other.

But also, in cases of imbalanced datasets where one class significantly outnumbers the other, accuracy alone can be misleading. A model that always predicts the majority class will have high accuracy but poor predictive quality. By considering both precision and recall, the F1 score provides a more accurate representation of a model’s effectiveness in these scenarios.

And its application in imbalanced datasets should also be highlighted. Common in many fields like fraud detection, medical diagnosis, and spam filtering, class imbalance poses a challenge as it skews the model toward the majority class. The F1 score is particularly useful here, as it requires a balance between false positives and false negatives, which are common issues in such datasets. Additionally, during model tuning, the F1 score proves to be a more effective goal than accuracy in these datasets. It aids in finding the optimal balance between sensitivity (recall) and specificity (precision) in scenarios where class distribution is uneven.

When comparing multiple models, especially in scenarios where precision and recall are both important, the F1 score offers a single metric to assess performance. This simplification is important for decision-making, especially when dealing with non-expert stakeholders.

Deepchecks For LLM VALIDATION

How to Apply and Calculate the F1 Score in Machine Learning

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison
TRY LLM VALIDATION

Example

Let’s consider an example in the context of the ML model used for email spam detection. The model’s goal is to classify emails as either ‘spam’ or ‘not spam.’ The performance of this model is summarized in the following confusion matrix:

Total number of emails: 200

Actual spamActual not spam
Predicted spamTrue Positives: 80False Positives: 20
Predicted not spamFalse Negatives: 30True Negatives: 70

From this confusion matrix, we can calculate the precision and recall of the model:

  • Precision: This metric tells us out of all the emails we predicted as spam, how many were actually spam.

  • Recall: This metric tells us out of all the actual spam emails, how many did we correctly predict as spam.

Now, let’s calculate the F1 score using these values:

Therefore, our spam detection model has an F1 Score of approximately 0.76. This score, being closer to 1, suggests that the model has a fairly good balance between precision and recall. However, it also indicates room for improvement, especially in reducing the number of false negatives (legitimate emails classified as spam) and false positives (spam emails not detected). The closer the precision and recall values are to each other, and to 1, the better the F1 score, reflecting a more effective and balanced model.

Applications of the F1 Score in real-world scenarios

The F1 score is particularly useful in applications where the balance between precision and recall is important to identify. Some areas where the F1 score finds its most impactful applications include:

1. Medical diagnosis

In medical diagnostics, the consequences of false negatives (missing a diagnosis) and false positives (incorrectly diagnosing a healthy patient) are profound. The F1 score helps in evaluating models where it’s important to minimize both types of errors.

The F1 score is used, for instance, to evaluate ML models that predict various diseases based on laboratory tests. A notable example is the development of an ensemble model, which combined deep learning with two ML models. This model was optimized using validation loss and achieved an F1 score of 81%, with a prediction accuracy of 92%, precision of 78%, and recall of 88%. The ensemble model showed superior performance compared to individual DL and ML models in all aspects, including the F1 score. It was particularly effective in predicting diseases such as acute hepatitis B, malaria, aplastic anemia, meningitis, and acute myocardial infarction. The ensemble model’s predictive power was further improved when optimized for accuracy, demonstrating the utility of the F1 score in balancing precision and recall for accurate disease classification​​.

Another application of the F1 score in medical diagnosis is seen in the field of multi-modality medical image classification. In this context, different platforms capable of processing large datasets, like those from Amazon, Apple, Google, and Microsoft, were compared based on their F1 scores. The analysis revealed significant differences in the F1 scores across these platforms, indicating the varying effectiveness of their ML models in classifying medical images. For example, models trained on the Waterloo OCT dataset exhibited high classification performance, with F1 scores ranging from 78.8% to 97.8% across different platforms. The study also highlighted the importance of features like custom test/train splits, batch prediction, cross-validation, and data augmentation for the effective application of these models in diagnostic classification problems​​.

2. Fraud detection

Fraudulent transactions are typically much less frequent than legitimate ones. This imbalance makes accuracy a less reliable metric, as a model that predicts ‘no fraud’ for all transactions might still appear accurate. Here, failing to catch fraud (false negatives) and wrongly flagging legitimate transactions (false positives) both have significant consequences.

For example, in the detection of identity theft, ML models are trained to identify unusual activities such as atypical login patterns, geographic locations, and device usage. These models are designed to recognize behaviors that deviate from an individual’s historical patterns. By correlating multiple data sources, these systems can detect instances of identity theft or fraudulent financial transactions.

Another aspect of fraud detection where the F1 score is vital is the development of anomaly detection techniques. These techniques involve identifying unusual patterns within datasets, which are specifically important for detecting potential fraud. Advanced ML-based algorithms, such as Isolation Forest and One-Class SVM, are used alongside traditional statistical methods to uncover irregularities and potential fraud. This significantly improves security and risk management strategies.

This metric is also used in building and training fraud detection models, both in supervised and unsupervised approaches. For example, predictive models like Random Forest and Neural Networks are developed using labeled data, where the models learn patterns of fraudulent and legitimate transactions. The model’s effectiveness is evaluated using metrics like precision, recall, and the F1 score, ensuring the model accurately identifies fraud while minimizing false positives and negatives​.

3. Sentiment analysis

In sentiment analysis, especially in social media monitoring or market research, accurately identifying sentiments (precision) is crucial, while capturing the breadth of relevant data (recall) is also crucial.

Companies rely on high F1 scores to ensure their sentiment analysis tools effectively guide marketing and strategy decisions. For example, Aylien, a text analytics API provider, used sentiment analysis for projects like predicting Oscar winners and analyzing Brexit sentiment. Similarly, KPMG used sentiment analysis for evaluating sustainability reports, requiring a detailed understanding of positive and negative sentiments. McKinsey’s City Voices project also employed sentiment analysis to capture and analyze citizens’ sentiments on various aspects of city life. In these applications, the F1 score’s ability to balance precision and recall makes it a vital tool for evaluating the effectiveness of sentiment analysis models.

These are only a few examples of useful applications where both aspects of a model’s performance – precision and recall – are critically important and where their balance has significant practical implications. Understanding the specific requirements and trade-offs in each application is key to effectively employing the F1 score as a measure of model performance.

Conclusion

In real-world scenarios, data is often imbalanced, presenting a challenge in training models accurately. While there are techniques like SMOTE to address this, verifying their effectiveness is specifically important. This is where the F1 score becomes a major player, as it considers both recall and precision, offering a more reliable measure of model performance in imbalanced situations. Understanding the F1-score, its calculation, and its applicability is important, especially nowadays.

To effectively navigate the challenges of imbalanced data and optimize your models, it’s important to understand and apply the F1 score. Dive deeper into this essential metric, explore model-building and evaluation techniques, and learn how to accurately assess and improve your models in real-world scenarios. Take action now to master these skills and improve your ML expertise.

Deepchecks For LLM VALIDATION

How to Apply and Calculate the F1 Score in Machine Learning

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison
TRY LLM VALIDATION

Recent Blog Posts

LLM Evaluation: When Should I Start?
LLM Evaluation: When Should I Start?
How to Build, Evaluate, and Manage Prompts for LLM
How to Build, Evaluate, and Manage Prompts for LLM