How Out-of-Distribution Detection Learning Theory Enhances Transformer: Learnability and Reliability

Yijin Zhou; Yutang Ge; Xiaowen Dong; Yuguang Wang

How Out-of-Distribution Detection Learning Theory Enhances Transformer: Learnability and Reliability

Yijin Zhou, Yutang Ge, Xiaowen Dong, Yuguang Wang

TL;DR

The paper addresses OOD generalization challenges in transformer models by introducing a PAC-based framework for OOD detection and proving learnability conditions in the separate data space $\,\mathcal{D}_{XY}^s$ under transformer capacity constraints. It defines the transformer hypothesis space $\mathcal{H}_{Trans}$, derives Jackson-type approximation bounds that tie learnability to budget and depth, and identifies the impossibility of learnability in the full open space. To bridge theory and practice, the authors propose GROD (Generate Rounded OOD Data), an approach that couples an ID–OOD binary loss with synthetic OOD generation via PCA/LDA projections and Mahalanobis filtering, achieving state-of-the-art performance on image and text tasks. They validate GROD across CV and NLP benchmarks, perform extensive ablations, and discuss practical training considerations and post-processing implications, highlighting its potential to improve reliability in real-world deployments. Together, the work advances both the theoretical understanding and practical toolkit for OOD detection in transformers, offering principled guidance for model design and training strategies.

Abstract

Transformers excel in natural language processing and computer vision tasks. However, they still face challenges in generalizing to Out-of-Distribution (OOD) datasets, i.e. data whose distribution differs from that seen during training. OOD detection aims to distinguish outliers while preserving in-distribution (ID) data performance. This paper introduces the OOD detection Probably Approximately Correct (PAC) Theory for transformers, which establishes the conditions for data distribution and model configurations for the OOD detection learnability of transformers. It shows that outliers can be accurately represented and distinguished with sufficient data under conditions. The theoretical implications highlight the trade-off between theoretical principles and practical training paradigms. By examining this trade-off, we naturally derived the rationale for leveraging auxiliary outliers to enhance OOD detection. Our theory suggests that by penalizing the misclassification of outliers within the loss function and strategically generating soft synthetic outliers, one can robustly bolster the reliability of transformer networks. This approach yields a novel algorithm that ensures learnability and refines the decision boundaries between inliers and outliers. In practice, the algorithm consistently achieves state-of-the-art (SOTA) performance across various data formats.

How Out-of-Distribution Detection Learning Theory Enhances Transformer: Learnability and Reliability

TL;DR

The paper addresses OOD generalization challenges in transformer models by introducing a PAC-based framework for OOD detection and proving learnability conditions in the separate data space

under transformer capacity constraints. It defines the transformer hypothesis space

, derives Jackson-type approximation bounds that tie learnability to budget and depth, and identifies the impossibility of learnability in the full open space. To bridge theory and practice, the authors propose GROD (Generate Rounded OOD Data), an approach that couples an ID–OOD binary loss with synthetic OOD generation via PCA/LDA projections and Mahalanobis filtering, achieving state-of-the-art performance on image and text tasks. They validate GROD across CV and NLP benchmarks, perform extensive ablations, and discuss practical training considerations and post-processing implications, highlighting its potential to improve reliability in real-world deployments. Together, the work advances both the theoretical understanding and practical toolkit for OOD detection in transformers, offering principled guidance for model design and training strategies.

Abstract

Paper Structure (58 sections, 11 theorems, 60 equations, 8 figures, 10 tables, 1 algorithm)

This paper contains 58 sections, 11 theorems, 60 equations, 8 figures, 10 tables, 1 algorithm.

Introduction
Related Works
Notations and preliminaries
Notations.
The transformer hypothesis space.
Theoretical results
OOD detection in the separate space
Conditions for learning with transformers.
Extent of learnability by capacity of transformer network.
OOD detection in other a-priori-unknown spaces
Perspective of leveraging auxiliary outliers
Gap of theory and training
ID-OOD binary classification loss function.
Generate rounded outliers.
GROD algorithm
...and 43 more sections

Key Result

Theorem 1.2

If $l(y_2, y_1)\leq l(K+1, y_1)$ for any in-distribution labels $y_1$ and $y_2\in \mathcal{Y}$, and the hypothesis space $\mathcal{H}$ is FCNN-based or corresponding score-based, then OOD detection is learnable in the separate space $\mathcal{D}^s_{XY}$ for $\mathcal{H}$ if and only if $|\mathcal{X}

Figures (8)

Figure 1: The classification and OOD detection results regarding improvements. The first row of subfigures shows results under varying OOD distributions, with scatter plots below depicting training and test data. The trade-off in $\mathcal{L}$ for different $\gamma$ and the effectiveness of adding rounded OOD data are highlighted. Results are calculated over five random seeds.
Figure 2: Overview of GROD algorithm: In the fine-tuning stage, GROD generates fake OOD data as part of the training data. GROD then guides the training by incorporating the ID-OOD classifier in the loss. In the inference stage, the features and adjusted Logits are input into the post-processor.
Figure 3: (a) The visualization of the generated two-dimensional Gaussian mixture dataset. (b) Curves show the classification accuracy and OOD detection accuracy of the training stage and test stage with different model capacities. And likelihood score bars demonstrate that the model with the theoretical support is disabled to learn OOD characters, leading to the failure of OOD detection.
Figure 4: Quantitative comparison of the computational costs associated with various OOD detection methods on image datasets is presented, with fine-tuning and post-processing times reported in subfigures (a) and (b), respectively. Methods with only post-processing including MSP, ODIN, VIM, GEN, and ASH are used after "baseline" fine-tuning. Outlier exposure methods OE and MIXOE use MSP for post-processing.
Figure 5: Ablation study on extra hyper-parameters in GROD. (a) The weight $\gamma$ in $\mathcal{L}$. (b) The parameter $a$ adjusts the extending distance of generated OOD data. (c) The number of every OOD cluster $num$. The ID dataset is CIFAR-10 and the backbone is the pre-trained ViT-B-16.
...and 3 more figures

Theorems & Definitions (34)

Definition 1.1: fang2022out, Strong learnability
Theorem 1.2: fang2022out, Informal, learnability in FCNN-based and score-based hypothesis spaces
Definition 3.1: Budget of a transformer block
Definition 3.2: Transformer hypothesis space
Definition 3.3: Classifier
Definition 3.4: Transformer hypothesis space for OOD detection
Lemma 4.1
Theorem 4.2: Necessary and sufficient condition for OOD detection learnability on transformers
Definition 4.3: Probability of the OOD detection learnability
Theorem 4.4
...and 24 more

How Out-of-Distribution Detection Learning Theory Enhances Transformer: Learnability and Reliability

TL;DR

Abstract

How Out-of-Distribution Detection Learning Theory Enhances Transformer: Learnability and Reliability

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (34)