If you like what we're working on, please  star us on GitHub. This enables us to continue to give back to the community.
DEEPCHECKS GLOSSARY

F-score

What is an F-score?

The F-score (also known as the F1 score or F-measure) is a metric used to evaluate the performance of a Machine Learning model. It combines precision and recall into a single score.

F-measure formula:

  • F-score = 2 * (precision * recall) / (precision + recall)

Accuracy in making positive predictions is measured by a recall, while identifying all positive occurrences in the data is quantified by precision. The F-score ranges from 0 to 1, with higher values indicating better performance.

The F-score in Machine Learning is often used when the goal is to balance precision with recall and is particularly useful when the positive class is rare. In a medical diagnosis system, it might be more important to ensure that the model has high recall (to minimize the risk of missed diagnoses), while in a spam filter, ensuring high precision is a top priority (to minimize the number of false positives).

In some cases, it may be necessary to weigh the precision and recall differently, in which case the F-score can be modified to be the weighted harmonic mean of precision and recall. This is known as the F-beta score, where beta is the weight applied to the recall.

F-2 score

An F-2 score is a variant of the F1 measurement, a metric used to evaluate the performance of a Machine Learning model. Like the F-score, the F-2 score combines precision and recall into a single score. However, the F-2 score places more emphasis on recall than the standard F-1 score.

The F-2 score is defined as:

  • F-2 score = (1 + 2^2) * (precision * recall) / (2^2 * precision + recall)

The F-2 score ranges from 0 to 1, with higher values indicating better performance.

This score is often used when the goal is to prioritize recall over precision. Just as we mentioned in our medical diagnosis example, it might be more important to ensure that the model has high recall (to minimize the risk of missed diagnoses). The F-2 score can be used to reflect this emphasis.

It is worth noting that the F-2 score is just one way to weigh the precision and recall components of the F-score. Other common weights include the F-0.5 score, which places more emphasis on precision.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Deepchecks HubOur GithubOpen Source

Applications of F-score

  • Classification tasks. The F-score is often used to evaluate the performance of a classifier, particularly when the goal is to balance precision and recall.
  • Information retrieval tasks. The F-score can be used to evaluate the performance of a search engine or other information retrieval system. Precision quantifies how well-retrieved documents match the query, whereas recall evaluates how many appropriate results were found.
  • Hyperparameter optimization. The F-score can be used as a performance metric when optimizing the hyperparameters of a Machine Learning model. This can be useful in finding the best set of hyperparameters for a given task.
  • Model comparison. The F-score can be used to compare the performance of different Machine Learning models on the same task. This can be useful when choosing the best model for a particular application.

It is worth noting that the F-score is just one metric that can be used to evaluate the performance of a Machine Learning model. Other common metrics include accuracy, Area Under the Curve (AUC), and log loss. The appropriate metric will depend on the specifics of the task and the goals of the model.