How to Check the Accuracy of Your Machine Learning Model

Introduction

Accuracy is among the most popular methods for validating ML models in classification problems and is a famous and widely used tool. Its widespread popularity is owed to its being straightforward-it is simple to understand and implement. For simple cases, it represents a valid measure of assessing model performance. Unfortunately, real-world scenarios are rarely simple. You often face cases where datasets are imbalanced, multiclass problems, or multilabel classification challenges. In these complex scenarios, high accuracy might not imply a good performance measure. As machine learning problems get more complex, calculating and interpreting accuracy becomes more difficult and requires special attention.

It is thus important to know what accuracy is measuring, how to derive it, and some of the caveats when it is used in different machine learning scenarios. This article gives an extended description of accuracy as a classification metric. More specifically, the article provides a definition of accuracy, shows its use in binary, multiclass, and multilabel settings, and touches on its main issues. To deepen your understanding, you will also find practical examples and the full code behind them here.

Accuracy

As one of the fundamental metrics for classification problems, accuracy refers to the measure of correct predictions made by the model. It is calculated as the number of correct predictions divided by all predictions. The accuracy formula in machine learning is as follows:

Accuracy

This is a very simple formula, giving rise to a very easily understandable definition of accuracy in those cases where the classification problem involves only two classes. Accuracy is an intuitive metric and easy to compute, but it assumes a binary classification context. Now, let’s see how to apply accuracy to multiclass and multilabel classification and discuss details concerning more complex cases.

Example implementation:

from sklearn.metrics import accuracy_score

# Example data
y_true = [0, 1, 0, 1, 0, 1, 1, 0] 
y_pred = [1, 1, 0, 1, 0, 0, 1, 0]

# Calculate accuracy score
accuracy = accuracy_score(y_true, y_pred)

# Print the accuracy score
print("Accuracy Score:", accuracy)

Accuracy

The Accuracy Paradox

Default accuracy is an overall measure of performance for a model on the entire dataset. But this overall accuracy can be misleading, especially for cases in which the class distribution is imbalanced and correctly predicting the minority class is important. The model can achieve a high accuracy score in such cases by correctly predicting the majority class but consistently misclassifying the minority class, thus giving a wrong impression of good performance. Take, for instance, a cancer prediction model where it is essential to identify whether a patient’s sample is malignant. Misclassification of benign as malignant has grave effects, as it entails inflicting unnecessary treatments on perfectly healthy individuals and erodes trust in the diagnostic process. And cases are generally highly biased, with more benign than malignant.

Let’s illustrate this with an example using the Wisconsin Breast Cancer dataset, which classifies breast tumor cases as benign or malignant.

Accuracy Paradox

By making the dataset imbalanced-removing most malignant cases so that only about 5.6% of cases are malignant-and using only a single feature, we challenge the model’s performance.

cancer_data_imbalanced = pd.concat(
    [
        cancer_data[cancer_data["labels"] == "malignant"].sample(
            frac=0.1, random_state=random_seed
        ),
        cancer_data[cancer_data["labels"] == "benign"],
    ]
)
cancer_data_imbalanced = cancer_data_imbalanced.loc[:, ["mean texture", "labels"]]
cancer_data_imbalanced["labels"].value_counts(normalize=True)

Accuracy Paradox

We also use only a single feature to make our model’s job harder.

Accuracy Paradox

Let’s see how well we can predict this situation.

model = DecisionTreeClassifier(random_state=random_seed)
prediction_results = get_prediction_results(X_train, y_train, y_test, model)
prediction_results["Prediction success"].sum() / prediction_results[
    "Prediction success"
].count()

Accuracy Paradox

Our model achieves an overall accuracy of approximately 94.64%, which initially seems impressive. However, a closer look at the class-level predictions using a confusion matrix reveals a different story: the model misdiagnoses almost all malignant cases. This result starkly contrasts with the high overall accuracy, demonstrating the accuracy paradox.

Accuracy Paradox

The high accuracy is an illusion in the case of the imbalanced data set, where it is very costly to misclassify the minority class. Similar cases occur when predicting rare but critical events, such as serious medical conditions, economic crises, terrorist attacks, or meteor impacts. In such instances, an accuracy score of 90% would be meaningless because not even one missed case should be allowed because it might lead to catastrophic results. In this way, depending only on accuracy is not enough and can be misleading.

To avoid this problem, ask yourself the following :

  • Is my data unbalanced?
  • How much does it cost to misdiagnose a class?

When accuracy doesn’t cut it as a good evaluation metric for your ML model, consider the following alternatives:

MetricsDescription
PrecisionHow many of the predicted positives are actually positive. Prioritize this when false positives are costly.
Recall (Sensitivity)How many of the actual positives are correctly identified. Prioritize this when missing positives is costly.
F1 ScoreA balanced metric combining precision and recall. Use when you need a single metric to summarize performance.
Confusion MatrixA detailed table showing true/false positives/negatives. Helps pinpoint where your model is making mistakes.
ROC Curve & AUCVisualizes the trade-off between true positive rate and false positive rate at various thresholds. Higher AUC means a better model overall.
PR-CurveSimilar to ROC, but focuses on the trade-off between precision and recall. Helpful for imbalanced datasets.
Matthews Correlation CoefficientA comprehensive metric considering true/false positives/negatives, even with imbalanced classes. Ranges from -1 (worst) to +1 (best).

Table: Alternate metrics description

Accuracy in Multiclass Problems

In multiclass classification problems, accuracy is defined similarly to binary classification, but the calculation must account for multiple classes rather than just two. Here is the generalized formula for accuracy in multiclass problems:

Accuracy in Multiclass

Where

  • N is the number of samples.
  • [[…]] is the Iverson bracket, which returns 1 when the expression within it is true and 0 otherwise.
  • yi and zi are the true and predicted output labels of the given sample, respectively.

Let’s see an example. The following confusion matrix shows true values and predictions for a 3-class prediction problem.

Accuracy in Multiclass

We calculate accuracy by dividing the number of correct predictions (the corresponding diagonal in the matrix) by the total number of samples.

Accuracy in Multiclass

The result tells us that our model achieved a 44% accuracy on this multiclass problem.

However, calculating an overall accuracy metric also conceals class-level issues in the multiclass case, so it is important to examine class-level predictions.

For example, let’s make predictions on the Iris dataset by using the sepal columns.

iris_data_sepal = iris_data.loc[:, ["sepal width (cm)", "sepal length (cm)", "labels"]]
iris_data_sepal.sample(5, random_state=random_seed)

Accuracy in Multiclass

model = DecisionTreeClassifier(random_state=random_seed)
prediction_results = get_prediction_results(X_train, y_train, y_test, model)
prediction_results["Prediction success"].mean()

Multiclass Accuracy

The overall accuracy is ~76.7%, which might not be that bad.

However, when we examine the results at the class level, the results are more diverse.

Accuracy is hard to interpret for individual classes in a multiclass problem, so we use the class-level recall values instead.

Multiclass Accuracy

The confusion matrix shows that we correctly predicted all the ‘setosa’ types but had only 75% success with the ‘versicolor’ and 50% with the ‘virginica’ ones.

[Hyperlink to the article on Understanding Classification Metrics: Accuracy, Precision, Recall, F1 Score, ROC-AUC, and PR-AUC for Binary and MultiClass Models]

Accuracy in Multilabel Problems

Multilabel classification differs from multiclass classification because, in multilabel, an instance can belong to multiple classes simultaneously, while in multiclass, each instance belongs to only one class. These problems can be viewed as multiple binary classification problems, one for each class.

Let’s see an example based on the RCV1 data set. In this problem, we try to predict 103 classes represented as a big sparse matrix of output labels. To simplify our task, we use a 1000-row sample.

Accuracy in Multilabel

The model seems to be accurate when we compare predictions with test values.

rcv1 = datasets.fetch_rcv1()
rcv1_data, rcv1_target, sample_id, target_names = (
    rcv1["data"],
    rcv1["target"],
    rcv1["sample_id"],
    rcv1["target_names"],
)
samples = np.random.randint(0, rcv1_data.shape[0], 1000)
rcv1_data_sample = rcv1_data[samples]
rcv1_target_sample = rcv1_target[samples]
rcv1_data_sample.shape, rcv1_target_sample.shape
X_train, X_test, y_train, y_test = train_test_split(
    rcv1_data_sample.toarray(),
    rcv1_target_sample.toarray(),
    train_size=0.8,
    random_state=random_seed,
)
model = DecisionTreeClassifier(random_state=random_seed)
predictions = model.fit(X_train, y_train).predict(X_test)
predictions

Accuracy in Multilabel

However, this is not a meaningful result because it relies on the huge number of ‘Negative’ values in the class vectors. We have a problem similar to the imbalanced binary case. Only now, we have many imbalanced class vectors where the majority of classes are the ‘Negative’ values.

Therefore, to get a more meaningful understanding of the model’s performance, we need to compute the accuracy with regard to the various metrics.

Multilabel Accuracy or Hamming Score

Hamming Score is a metric used in multilabel settings that compares the total number of labels active in both reality and as predicted with the number of properly predicted labels.

Accuracy in Multilabel

Where

  • N is the number of samples.
  • Yi and Zi are the true and predicted output label sets for the given sample.

Multilabel accuracy is more balanced since it does not depend on the criterion of ‘exact match,’ as it does in subset accuracy. It also does not consider ‘True Negative’ values as ‘correct’ in a naive manner. The better the model, the closer the Hamming Score is to 1.

def hamming_score(y_test, predictions):
    return (
        (y_test & predictions).sum(axis=1) / (y_test | predictions).sum(axis=1)
    ).sum() / predictions.shape[0]
hamming_score(y_test, predictions)

Accuracy in Multilabel

Hamming Loss

Hamming Loss measures the fraction of incorrect labels to the total number of labels, accounting for both false positives and false negatives. It ranges from 0 to 1, with 0 indicating no errors. The formula for Hamming Loss is:

Hamming Loss

Where

  • N is the number of samples.
  • k is the number of labels.
  • Yi and Zi are the given sample’s true and predicted output label sets, respectively.
  • is the symmetric difference

The main advantage of Hamming Loss is its simplicity. A lower Hamming Loss indicates better model performance.

def hamming_loss(y_test, predictions):
    return (y_test != predictions).sum().sum() / y_test.size
hamming_loss(y_test, predictions)

Hamming Loss

Beyond Hamming Score and Hamming Loss, you can use multilabel versions of standard classification metrics seen in binary and multiclass cases. For further exploration of multilabel metrics, see [Hyperlink to the article on Understanding Classification Metrics: Accuracy, Precision, Recall, F1 Score, ROC-AUC, and PR-AUC for Binary and MultiClass Models]. Such metrics and techniques could help you obtain a more realistic and accurate picture of the performance of the model in cases of multilabel classification problems and, therefore, overcome the limitation of using only a single accuracy metric.

Deepchecks For LLM VALIDATION

How to Check the Accuracy of Your Machine Learning Model

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison
TRY LLM VALIDATION

Subset Accuracy or Exact Match Ratio

Subset Accuracy is also known as Exact Match Ratio or Labelset Accuracy. It is a more strict form of the Accuracy metric. In order for the prediction to be correct, all of the labels have to match exactly. Its formula is as follows:

Subset Accuracy

Where

  • N is the number of samples.
  • [[…]] is the Iverson bracket, which returns 1 when the expression within it is true and 0 otherwise.
  • Yi and Zi are the given sample’s true and predicted output label sets, respectively. (Please note that we compare full label sets here, not single labels.)

Since this metric compares entire label sets rather than individual labels, achieving high Subset Accuracy can be particularly challenging, especially with a large number of labels. Consequently, Subset Accuracy often performs poorly for models dealing with many labels.

def exact_match(y_test, predictions):
    return (y_test == predictions).all(axis=1).mean()
exact_match(y_test, predictions)

Subset Accuracy

This strict criterion does not account for partial correctness. For example, for a model that predicted almost all of the labels correctly but missed only one, Subset Accuracy regards it as a failure. Therefore, the above-mentioned limitation makes Subset Accuracy less informative in practice, where partial correctness is valuable. The obvious alternative in this case is to use other metrics that allow for partial correctness, like the following:

  • Hamming Score: Describes the ratio of the number of correctly predicted labels to the total number of labels and is the best metric in order to balance the performance view well.
  • Hamming Loss: Returns the fraction of labels that were incorrectly predicted and gives you a better insight into false positives and false negatives.
  • Precision, Recall, and F1 Score: Used to multilabel the same way as for single-label tasks and can be used to estimate the performance of every label individually, yet providing a comprehensive insight.

With such alternative metrics, you are likely to get a better understanding of your model’s performance, where exact matches are rare, but partial correctness is important.

Further Accuracy Types

The most important applications of accuracy in multiclass, multilabel, and binary problems have been analyzed. Naturally, according to your concrete problem, more accuracy adjustments could be useful. Here are some of the more common:

  • Balanced Accuracy: The two types of classification-binary and multiclass-can use balanced accuracy. It is applied for the imbalance data, in which one of the target groups is more frequent than the other, and is calculated as the arithmetic mean between sensitivity and specificity.
from sklearn.metrics import balanced_accuracy_score

y_true = [0, 1, 0, 1, 0, 1, 0, 1]
y_pred = [0, 0, 0, 1, 1, 1, 1, 0]

balanced_accuracy = balanced_accuracy_score(y_true, y_pred)
print(f"Balanced Accuracy: {balanced_accuracy}")

Balanced Accuracy

  • Top-K Accuracy: When the classification involves predicting the top K probable classes for each instance, Top-K Accuracy is used. The prediction is counted as correct if the true label is one among the top K predicted labels. Most notably, this metric is useful in cases of recommendation systems and image classifications where more than one prediction can be accepted.
from sklearn.metrics import top_k_accuracy_score

y_true = [0, 1, 2, 2, 1]
y_pred_proba = [
    [0.2, 0.3, 0.5],
    [0.1, 0.6, 0.3],
    [0.3, 0.2, 0.5],
    [0.4, 0.4, 0.2],
    [0.1, 0.7, 0.2]
]
k = 2

top_k_acc = top_k_accuracy_score(y_true, y_pred_proba, k=k)
print(f"Top-{k} Accuracy: {top_k_acc}")

Top-K Accuracy

  • Accuracy of Probability Predictions: This is a measure of how the predicted probabilities are close to the actual probabilities and not just the final class labels. A typical candidate for evaluating probability predictions is the Logarithmic Loss, also known as Log Loss, and the Brier Score.
from sklearn.metrics import log_loss, brier_score_loss
y_true = [0, 1, 1, 0, 1]
y_pred_proba = [0.2, 0.6, 0.3, 0.4, 0.7]
log_loss_value = log_loss(y_true, y_pred_proba)
print(f"Log Loss: {log_loss_value}")
brier_score = brier_score_loss(y_true, y_pred_proba)
print(f"Brier Score: {brier_score}")

Accuracy of Probability Predictions

When to use Accuracy Score in ML

Accuracy score should be used when you want to know the skill of a model to classify data points correctly, irrespective of the prediction performance per class or label. It gives you an intuition for whether your data is suitable for your classification problem.

If you need to utilize the accuracy metric in your project, there are very simple-to-use packages like Deepchecks that give you in-depth reports on relevant metrics to evaluate your model. This makes it easier for you to understand your model’s performance better.

Be Sure How to Measure the Accuracy of Your ML

Whatever metric you choose, you should know what it is good for, its caveats, and what processes you can use to validate against its common pitfalls. The bigger the ML projects you have, the more complex the system of metrics you need to monitor. You have to learn about them, know how to implement them, and keep them in check continuously. These tasks can become hard to maintain and tend to introduce wrong metrics, measurements, and interpretations.

Measure the Accuracy of Your ML

Deepchecks Model evaluation, validation, and monitoring suite

One way to make model evaluation, validation, and monitoring easier is to utilize ML solutions like deepchecks at the different stages of the ML lifecycle. It provides a broad range of already tried and tested metrics with worked-out implementation and detailed documentation. Using Deepchecks, you can choose from a wide range of verified and documented metrics to better understand the workings of your ML models and trust them more.

Final notes

As such, accuracy is not a universally applicable parameter for machine learning models, though it can be quite a useful one. Unbalanced datasets, multiclass issues, multilabel scenarios-it may severely restrict all of these. For these reasons, to do a good thorough performance assessment with your model, one should identify its limitations and attempt to investigate the use of alternative measures such as precision, recall, F1 score, and others, depending on the particular situation and their associated costs.

Additionally, Deepchecks can streamline the model evaluation, validation, and monitoring process. These tools offer a wide range of verified metrics and detailed documentation to ensure a deeper understanding and greater trust in your ML models. Are you interested in how? Get started. By carefully considering the appropriate metrics and utilizing available tools, you can make informed decisions and optimize your ML models for better real-world performance.

Deepchecks For LLM VALIDATION

How to Check the Accuracy of Your Machine Learning Model

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison
TRY LLM VALIDATION

Recent Blog Posts

Precision vs. Recall in the Quest for Model Mastery
Precision vs. Recall in the Quest for Model Mastery