What is an F1 score?

published on 27 September 2024

Artificial intelligence has integrated into different facets of our everyday lives, from virtual assistants and personalized recommendations to healthcare diagnostics and fraud detection; we are twice as likely to interact with a piece of software or tool powered by AI than we were a couple of years ago.  While this is a positive development, it raises an important question: how can we trust the predictions or outputs of AI-powered solutions? This concern is especially important in situations where inaccurate predictions could result in significant losses, both financial and even in terms of human life.

In this article, we will explore performance metrics like the F1 score for evaluating the effectiveness of classification models, how it’s calculated, and why it’s often preferred over other performance metrics like precision or recall when evaluating classification models.

Understanding the importance of F1 score in classification models

Classification models are algorithms that analyze and categorize complex data sets into predefined classes or labels. These models are used across various sectors, such as anomaly detection, medical diagnosis, text classification, and more. For example, in anomaly detection, classification models help label data points as either "anomalous" or "non-anomalous." 

Similarly, in medical diagnosis, a classification model might be used to detect cancer by categorizing patient data into "cancerous" or "non-cancerous" groups.

In such examples, “false positives” and “false negatives” in classification models can have serious consequences. So how can we trust the predictions of these models? The F1 score offers one way to evaluate how well a classification model recognizes and categorizes data into different subsets. To fully understand the F1 score, let's explore three important concepts:

  • The possible outcomes of a classification model
  • What are precision and recall performance metrics
  • How precision and recall combine to give a more comprehensive assessment of a model’s performance which is captured by the F1 score.

Now that we've outlined the significance of classification models, it's important to take a closer look at their prediction outcomes. These outcomes form the foundation for performance metrics such as precision, recall, and, more importantly, the F1 score.

Understanding the possible outcomes of a classification model

A classification model prediction typically falls into one of these four categories:

  • True Positives: These are events or data points that were correctly predicted as positive.
  • True Negatives: These are events that were correctly predicted as negative.
  • False Positives: These are events that were incorrectly predicted as positive but were actually negative.
  • False Negatives: These are events that were incorrectly predicted as negative but were actually positive.


These four outcomes form the basis of precision and recall, which together make up the F1 score.

What are precision and recall?

Now that we understand these four outcomes, let's use them to explain precision and recall.

Precision
The precision performance metric determines the quality of positive predictions by measuring their correctness. In other words, it measures how many of the positive predictions made by the model were actually correct. Precision is calculated by dividing the number of true positive outcomes by the sum of the true positives and false positives.

Precision = True Positives / (True Positives + False Positives)

Example:

To better understand precision, let’s consider a pool of 500 emails, and a spam filter that has been employed to figure out how many of these emails are spam.

Suppose the filter identifies 120 emails as spam, but only 100 of those emails are actually spam. In this case, the precision of the spam filter would be:

Precision = 100 / (100 + 20) = 0.833

This means that 83.3% of the emails that the filter identified as spam were actually spam.

While precision focuses on the accuracy of positive predictions, recall assesses the model's overall ability to identify all actual positive cases.

Recall
Recall, also known as sensitivity, measures a model’s ability to accurately detect positive events. In simpler terms, it indicates how many of the actual positive instances were correctly identified by the model. Recall can be calculated using the formula below:

Recall = True Positive / (True positive + False Negative)

Example:

Let's return to the spam email filter example. We saw that out of the filter’s prediction of 120 spam emails, 100 were indeed spam. However, what if there were actually 200 spam emails in total? Then in this scenario, the recall would be:

Recall = 100/ 200 = 0.5

This means that the filter correctly identified 50% of all actual spam emails. 

While precision and recall provide valuable insights into a model’s performance, relying solely on one without considering the other can give an incomplete picture.

Limitations of using precision as a classification metric without recall (and vice versa)

Considering precision without recall, and vice versa, can lead to a misleading evaluation of a model's performance, especially in scenarios where class distribution is imbalanced or where different types of errors (false positives vs. false negatives) have varying consequences.

Limitations of Precision without Recall

Precision alone focuses solely on the correctness of the positive predictions, ignoring how well the model captures all possible positives. A model with very high precision might seem impressive, but if it misses a large number of actual positive instances (low recall), it could be underperforming. This often occurs in cases where a model is extremely cautious about making positive predictions, leading to fewer but more accurate positive results. This cautious approach minimizes false positives but increases false negatives.

For example, imagine a medical diagnosis model designed to detect a rare disease. If the model has perfect precision but low recall, it correctly identifies all the positive cases it flags as having the disease. However, if it only flags 2 out of 50 actual positive cases, its recall is very low. This means that while every diagnosed patient truly has the disease (precision is 100%), the model is missing the vast majority of patients who actually have it, making it unreliable for early diagnosis and treatment.

Limitations of Recall without Precision

Similarly, focusing on recall alone means you're only considering how many true positives are identified out of the total actual positives without regard to how many false positives the model produces. A high recall could indicate the model captures most positive instances, but it might be over-predicting positives, leading to a flood of false positives and reduced accuracy in actual predictions.

Using the medical diagnosis example, imagine a medical diagnosis model with 100% recall that flags every patient as having the disease to ensure it never misses a single case. While the recall is perfect, the precision is incredibly low because many healthy individuals will be wrongly diagnosed. This makes the model impractical, as it would result in unnecessary anxiety and treatments for people who do not actually have the disease.

This highlights the importance of a comprehensive metric combining precision and recall—the F1 score.

What is an F1 score?

An F1 score can be understood as the harmonic mean of precision and recall, combining both these metrics into one comprehensive assessment that neither performance metric can offer alone.

The F1 score is described as the harmonic mean of both precision and recall for two important reasons:

  • The F1 score gives both of these metrics equal weights, ensuring that a good F1 score signifies that the model has a good balance between precision and recall.
  • Unlike the arithmetic mean, the harmonic mean prevents a high precision score from disproportionately affecting the overall F1 score when recall is low, and vice versa.

The F1 score can be calculated as:

F1 - score = 2 * (precision * recall) / (precision + recall)

So, using the original example of the spam filter, with a precision of 0.8333 and a recall of 0.5, the F1 score of the spam filter would be:

F1 score = 2 * (0.8333 * 0.5) / ( 0.8333 + 0.5 )
F1 score = 0.625

```

```

Calculating a model's F1 score can provide a clearer, more balanced measure of its performance, especially in cases where both precision and recall are critical. 

Interpreting the F1 score


Similar to most performance metrics, the F1 score ranges from 0 to 1, with 0 representing the worst possible score and 1 representing the best possible score a model can get. 

A high F1 score indicates that the model has good precision and recall, showing a well-balanced performance. Conversely, a low F1 score may suggest a trade-off between precision and recall or indicate that the model performs poorly on both metrics.

This comprehensive insight provided by the F1 score is particularly crucial in anomaly detection, as it helps evaluate the model's ability to accurately recognize and identify anomalous events.

F1 score in anomaly detection

Anomaly detection, once a labor-intensive process, has become much more efficient with the rise of artificial intelligence. Advanced tools like Eyer, an AI-powered anomaly detection platform, have streamlined this process by automating the identification of unusual data patterns.

At its core, anomaly detection involves analyzing data to identify patterns or behaviors that deviate significantly from the norm. These deviations, often referred to as anomalies or outliers, can signal critical events such as fraud, system failures, or network intrusions. By using Eyer's sophisticated algorithms, these anomalies can be detected earlier and with greater accuracy, enabling organizations to respond to potential threats in real-time.

Given the potential consequences of relying on ineffective anomaly detection tools, it’s crucial to trust the performance of platforms like Eyer. One way to measure this trust is through the F1 score, which provides valuable insights into the balance between precision and recall.

For a deeper dive into Eyer's performance, including its F1 score testing results, check out the official documentation and read all about Eyer’s findings on the F1 performance testing of the core algorithm of Eyer.

In summary

Many of the artificial intelligence models we encounter in our daily lives are classification models. These models help us determine whether data has specific characteristics, ranging from something as simple as identifying spam emails to more critical applications like diagnosing cancer in patients.

Since we often don’t know the correct answers to the questions posed by classification models, it’s essential to trust these systems to make accurate predictions and draw the right conclusions from the data. This is where the F1 score comes into play.

The F1 score offers a balanced evaluation of a classification model’s performance by considering both precision and recall. Its value lies in providing a comprehensive measure that neither precision nor recall can fully capture. This makes the F1 score particularly vital in high-stakes scenarios like anomaly detection and medical diagnosis, where both false positives and false negatives can have serious consequences. By understanding and calculating the F1 score, we gain deeper insights into the effectiveness of AI-powered classification models, allowing us to develop more reliable and trustworthy systems. Tools like Eyer, which incorporate the F1 score into their evaluations, demonstrate how this metric can enhance decision-making in real-world AI applications.

Ultimately, using the F1 score not only helps validate the performance of these models but also ensures that they align with the critical needs of various sectors. Whether in healthcare, finance, or cybersecurity, understanding the strengths and weaknesses of classification models through the F1 score can lead to better outcomes and increased confidence in automated decisions. As reliance on AI grows, prioritizing robust evaluation metrics like the F1 score will be essential for building the next generation of intelligent systems that we can trust.

Lastly, check out Eyer for an F1 score-approved, AI-powered anomaly detection tool to monitor your systems.

Read more