Logit Scaling for Out-of-Distribution Detection

Andrija Djurisic; Rosanne Liu; Mladen Nikolic

Logit Scaling for Out-of-Distribution Detection

Andrija Djurisic, Rosanne Liu, Mladen Nikolic

TL;DR

This work tackles OOD detection in open-world settings by introducing Logit Scaling (LTS), a post-hoc method that requires no access to training-data statistics and preserves the original network. LTS derives a per-sample scaling factor from penultimate-layer activations and applies it to logits, with final OOD scoring based on the energy function $E(\mathbf{x}; f) = -\log \sum_{i=1}^C e^{S(\mathbf{x}) f_i(\mathbf{x})}$, where $S(\mathbf{x})$ is computed from the top $p\%$ activations. Across CIFAR-10/100, ImageNet, and OpenOOD benchmarks on 9 architectures, LTS achieves state-of-the-art OOD detection metrics (AUROC, FPR@95, AUPR), while preserving in-distribution accuracy and maintaining cross-architecture robustness. The method's simplicity, efficiency, and broad applicability make it a practically impactful tool for reliable model deployment in diverse settings. Future work will address performance gaps on Far-OOD and non-convolutional architectures, and refine integration with complementary techniques like ReAct.

Abstract

The safe deployment of machine learning and AI models in open-world settings hinges critically on the ability to detect out-of-distribution (OOD) data accurately, data samples that contrast vastly from what the model was trained with. Current approaches to OOD detection often require further training the model, and/or statistics about the training data which may no longer be accessible. Additionally, many existing OOD detection methods struggle to maintain performance when transferred across different architectures. Our research tackles these issues by proposing a simple, post-hoc method that does not require access to the training data distribution, keeps a trained network intact, and holds strong performance across a variety of architectures. Our method, Logit Scaling (LTS), as the name suggests, simply scales the logits in a manner that effectively distinguishes between in-distribution (ID) and OOD samples. We tested our method on benchmarks across various scales, including CIFAR-10, CIFAR-100, ImageNet and OpenOOD. The experiments cover 3 ID and 14 OOD datasets, as well as 9 model architectures. Overall, we demonstrate state-of-the-art performance, robustness and adaptability across different architectures, paving the way towards a universally applicable solution for advanced OOD detection.

Logit Scaling for Out-of-Distribution Detection

TL;DR

, where

is computed from the top

activations. Across CIFAR-10/100, ImageNet, and OpenOOD benchmarks on 9 architectures, LTS achieves state-of-the-art OOD detection metrics (AUROC, FPR@95, AUPR), while preserving in-distribution accuracy and maintaining cross-architecture robustness. The method's simplicity, efficiency, and broad applicability make it a practically impactful tool for reliable model deployment in diverse settings. Future work will address performance gaps on Far-OOD and non-convolutional architectures, and refine integration with complementary techniques like ReAct.

Abstract

Paper Structure (18 sections, 2 equations, 13 figures, 6 tables, 1 algorithm)

This paper contains 18 sections, 2 equations, 13 figures, 6 tables, 1 algorithm.

Introduction
Related work
Logit Scaling for OOD Detection
Experiments
OOD detection benchmark
OOD evaluation metrics
OOD detection performance
LTS applicability across architectures
Failure cases and limitations
Ablation studies
Conslusion
Detailed CIFAR-10 And CIFAR-100 Results
LTS Performance Across Eight Architectures
Application of LTS to different architectures
Differences between LTS and SCALE
...and 3 more sections

Figures (13)

Figure 1: Overview of the LTS method for OOD detection. LTS works at inference time during forward pass. It takes features representations and computes sample-specific scalar value which is then used to scale the logits. Final OOD detection score is calculated by applying scoring function to scaled logits. LTS incurs minimal computational costs and it doesn't modify activations in any way thus completely preserves original working of the network while enhancing OOD detection significantly.
Figure 2: Effect of LTS Treatment. Plots demonstrate the changes in the distribution of logits and Energy scores resulting from LTS treatment. The left-hand plots represent the state before LTS application, while the right-hand plots (zoomed-in) represent the state after LTS is applied. The plots were generated using a ResNet-50 architecture pretrained on the ImageNet-1k (ID) dataset, with iNaturalist serving as the OOD dataset. The application of LTS produces more extreme logit distribution for OOD samples and improves the separation between ID and OOD scores, leading to a substantial enhancement in OOD detection performance. The logit distribution plots were generated using a single ID and a single OOD sample, whereas the energy score plots were created using 200 images sampled from both the ID and OOD datasets.
Figure 3: Activation values and examples of ID and OOD samples. On the left we plot the activation values of all the 2048 units in the penultimate layer of a ResNet-50 pretrained on ImageNet-1k, of ID (ImageNet-1k) and OOD (iNaturalist) samples. 100 samples are taken from each dataset and the values are their average. The figure is a replication of Figure 1(b) of react. On the right we show example pictures from the corresponding dataset, including class prediction and confidence.
Figure 4: Performance comparison of OOD methods across five architectures. This figure compares the performance of various OOD detection methods across five different architectures. The evaluation is based on two metrics: AUROC (top figure) and FPR@95 (bottom figure). All results are tested on ImageNet-1k benchmark and averaged across 4 tasks (iNaturalist, SUN, Places, and Textures). Higher AUROC indicates better performance, while lower FPR@95 indicates better performance.
Figure 5: Analysis of optimal $p$ values for scaling factor calculation. The top row shows the performance of LTS evaluated across five architectures using various values of the hyperparameter $p$, with results averaged over four tasks. The bottom row focuses on a single architecture, namely DenseNet, pretrained on three different in-distribution datasets: CIFAR-10, CIFAR-100, and ImageNet-1k. Across all our experiments, LTS consistently achieves optimal performance at $p = 5\%$ on both evaluation metrics, AUROC and FPR@95. We therefore recommend using $p = 5\%$ as the default setting.
...and 8 more figures

Logit Scaling for Out-of-Distribution Detection

TL;DR

Abstract

Logit Scaling for Out-of-Distribution Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (13)