Neural networks and out-of-distribution data

A crucial criterion for deploying a strong classifier in many real-world machine learning applications is statistically or adversarially detecting test samples chosen sufficiently far away from the training distribution. Many classification tasks, such as speech recognition, object detection, and picture classification, have been done with great accuracy using neural networks (DNNs). However, determining the prediction uncertainty is still a difficult task. Predictive uncertainty that is well-calibrated is essential since it can be used in a variety of machine learning applications.

Neural networks employing the softmax classifier, on the other hand, are known to yield significantly overconfident results. Loss is suitable for applications requiring tolerance such as product suggestions, it is risky to utilize those kinds of systems in intolerant fields such as robotics or medicine because they can result in fatal mishaps. When possible, an effective AI system should be able to generalize to OOD cases, flagging those beyond its capacity and requesting human intervention.

While in distribution examples are likely to have the same false patterns as OOD examples, neural network models can significantly rely on spurious cues and annotation artifacts inherent in the odd training data.

Because the training data cannot cover all aspects of a distribution, the model’s capacity to generalize is limited.

Out-of-Distribution (ODD)

For Language and Vision activities, the term “distribution” has slightly different meanings. Consider a task to classify cat breed photographs; photographs of cats would be in-distribution, while photographs of dogs, humans, balls, and other objects would be out-of-distribution.

The data distribution in real-world activities generally drifts with time, and tracking a developing data distribution is expensive.

  • OOD identification is critical in preventing AI systems from making predictions that are incorrect.

Various ODD detection techniques

Ensemble Learning

This model is used to produce predictions for each data point in Ensemble Learning, and the decisions made by these models are merged to improve overall performance. There are several methods for combining decisions:

Averaging – for regression tasks, averaging the predictions of all models is simple, but for classification problems, we can average the softmax probabilities.

Weighted Averaging — In this methodology, different weights are allocated to models, and the final prediction is computed using a weighted average.

Maximum Voting – the final prediction is based on the majority of the models’ predictions.

Combining the judgments, in this case, refers to calculating the forecast confidence across various models.

Using Binary Classification model

It entails testing the trained model on a dataset that has been held out and labeling correctly answered examples as positive and poorly answered instances as negative (Note that this step is independent of the actual label of the examples). Then, using this annotated dataset, a binary classification model may be trained to predict whether incoming samples are in the positive or negative class.

Though this methodology is better suited to the problem of Success and Error Prediction, it can simply be adjusted for detecting out-of-distribution by including OOD instances in the calibrator training.

The difficulty of predicting probability estimations that are indicative of the ground truth correctness likelihood is known as calibration. A model should produce a prediction as well as a confidence measure for this.


The output of a neural network model for classification tasks is a vector known as logits. To obtain class probabilities, the logit vector is processed through a softmax function. The prediction confidence is calculated using the highest softmax probability.

  • This is one of the most basic but effective out-of-distribution detection methods →

Temperature scaling

The softmax function is used to calculate the prediction confidence for MaxProb. Temperature scaling is a Platt scaling modification that uses a single scalar parameter T > 0. The prediction confidence is calculated using the function q, which is illustrated below.

With T > 1, it “softens” the softmax. As a result, the network becomes significantly less confident, and the confidence scores begin to represent genuine probabilities.

The probability approaches 1/J, which indicates the greatest uncertainty, as T increases.

The original softmax probability is regained when T = 1.

On the validation set, parameter T is learned with respect to Negative Log-Likelihood.

Temperature scaling has no effect on the model’s accuracy because the parameter T does not modify the softmax function’s maximum (i.e. the class prediction as the probability of all classes is scaled).