Exploring Simple, High Quality Out-of-Distribution Detection with L2 Normalization

Jarrod Haas; William Yolland; Bernhard Rabus

Exploring Simple, High Quality Out-of-Distribution Detection with L2 Normalization

Jarrod Haas, William Yolland, Bernhard Rabus

TL;DR

This work tackles unreliable confidence in deep classifiers for out-of-distribution inputs by proposing a remarkably simple baseline: apply L2 normalization to encoder features during training. The method decouples feature magnitude from direction, allowing norms to carry rich information about input familiarity without additional losses or tuning, and can be implemented with two lines of PyTorch code. Empirically, it yields competitive OoD detection on several architectures and ID datasets, often with faster training than state-of-the-art methods. The authors connect this behavior to Neural Collapse theory and coherent gradient dynamics, offering a theoretical lens for why feature norms encode useful image information and signaling a promising direction for efficient OoD detection research.

Abstract

We demonstrate that L2 normalization over feature space can produce capable performance for Out-of-Distribution (OoD) detection for some models and datasets. Although it does not demonstrate outright state-of-the-art performance, this method is notable for its extreme simplicity: it requires only two addition lines of code, and does not need specialized loss functions, image augmentations, outlier exposure or extra parameter tuning. We also observe that training may be more efficient for some datasets and architectures. Notably, only 60 epochs with ResNet18 on CIFAR10 (or 100 epochs with ResNet50) can produce performance within two percentage points (AUROC) of several state-of-the-art methods for some near and far OoD datasets. We provide theoretical and empirical support for this method, and demonstrate viability across five architectures and three In-Distribution (ID) datasets.

Exploring Simple, High Quality Out-of-Distribution Detection with L2 Normalization

TL;DR

Abstract

Paper Structure (16 sections, 5 equations, 7 figures, 8 tables)

This paper contains 16 sections, 5 equations, 7 figures, 8 tables.

Introduction
Background
Problem Setup
Related Work
Methodology
Cross Entropy Loss and Neural Collapse
L2 Normalization: Decoupling Feature Vectors from Equinormality
Experiments
Equinormality Under Cross Entropy Loss
Norm Growth During Training
Image Information Encoded in Features
Conclusion
Appendix
Training Details
Measuring Decoupled Feature Norms
...and 1 more sections

Figures (7)

Figure 1: A Pytorch code snippet illustrating the proposed method, which only requires the addition of two lines of code to a standard forward function. When training is complete, pre-normalized feature norms can be sampled along with model predictions in the usual manner. These norms are the OoD score: smaller norms are more likely to be OoD.
Figure 2: Progression of Neural Collapse during training (left to right). Under Cross Entropy loss, features converge to equinormal and equiangular vectors. Small blue spheres represent extracted features (classes are different shades of blue), blue ball-and-sticks are class-means, red ball-and-sticks are linear classifiers. The simplex ETF pictured is on the 2D plane in 3D space, such that each arm is equidistant at 120 degrees. Image from papyan2020prevalence.
Figure 3: (Left) As a result of the orthogonality of the loss w.r.t. features (when L2 normalization is used), any weight updates that result in changes to features push features along their tangent line to the hypersphere. This means each backward pass makes features slightly larger, and this trend continues indefinitely in the absence of weight decay (Image adapted from DBLP:journals/corr/normface). (Center) We separate converged features of the CIFAR10 test set into four groups, based on feature norms. As predicted, features that had the highest magnitudes at the end of training change the most during training. (Right) Variability of norms decreases during training to minimize CE loss without L2 normalization (the equinormality condition of NC), however, it increases during training with L2 normalization.
Figure 4: Plots of feature norms vs softmax scores for all CIFAR10 (blue) and SVHN (orange) test images. (Left) Allowing feature magnitudes to grow saturates the softmax function, and decreases the headroom available for separability. This occurs in both L2 and NoL2 models, although saturation is greater in L2 models. (Center) Scaled down features from the NoL2 ResNet18 350 model. (Right) Scaled down features from the L2 ResNet18 60 model. The nearly linear correlation of feature norm to softmax, in addition to the more isolated cluster of OoD images, is evidence that more image-level information is retained on a per-image basis than in NoL2 models.
Figure 5: Comparison of norms under NoL2 ResNet18 350 (left) and L2 ResNet 60(right) models, four datasets (CIFAR10 Test, SVHN, Gaussian noise and pixel-wise scrambled CIFAR10 Test). We hypothesize that models without L2 normalization generate norms that are invariant to inputs in order to meet the equinormality condition of CE loss. Our results support this claim: all datasets have roughly equivalently sized norms on average with very little variation amongst them. When L2 normalization is used during training, ID norms (measured prior to normalization) grow large and have high variability, while OoD norms are much smaller. The sensitivity of convolutions to features is not being conditioned/suppressed to produce equinormality in the latter case, which results in richer feature-level information and better OoD detection.
...and 2 more figures

Exploring Simple, High Quality Out-of-Distribution Detection with L2 Normalization

TL;DR

Abstract

Exploring Simple, High Quality Out-of-Distribution Detection with L2 Normalization

Authors

TL;DR

Abstract

Table of Contents

Figures (7)