When and How Does In-Distribution Label Help Out-of-Distribution Detection?

Xuefeng Du; Yiyou Sun; Yixuan Li

When and How Does In-Distribution Label Help Out-of-Distribution Detection?

Xuefeng Du, Yiyou Sun, Yixuan Li

TL;DR

This work asks when and how in-distribution labels improve out-of-distribution detection. It builds a graph-based framework where ID data form a similarity graph and learns representations via spectral decomposition, which is shown to be equivalent to a contrastive objective. The authors derive a provable lower bound on the improvement in OOD detection accuracy when ID labels are used, expressed in terms of ID connectivity and ID–OOD coupling, and provide intuitive insights and a simplified bound for near vs far OOD regimes. They validate the theory with both synthetic and real datasets (e.g., CIFAR-10/100), demonstrating that ID labels yield notable gains in near-OOD scenarios and under certain connectivity conditions, with results robust to changes in the OOD distribution between training and evaluation. The work advances theoretical understanding of the ID–OOD relationship and offers practical guidance for leveraging ID labels in OOD-sensitive applications.

Abstract

Detecting data points deviating from the training distribution is pivotal for ensuring reliable machine learning. Extensive research has been dedicated to the challenge, spanning classical anomaly detection techniques to contemporary out-of-distribution (OOD) detection approaches. While OOD detection commonly relies on supervised learning from a labeled in-distribution (ID) dataset, anomaly detection may treat the entire ID data as a single class and disregard ID labels. This fundamental distinction raises a significant question that has yet to be rigorously explored: when and how does ID label help OOD detection? This paper bridges this gap by offering a formal understanding to theoretically delineate the impact of ID labels on OOD detection. We employ a graph-theoretic approach, rigorously analyzing the separability of ID data from OOD data in a closed-form manner. Key to our approach is the characterization of data representations through spectral decomposition on the graph. Leveraging these representations, we establish a provable error bound that compares the OOD detection performance with and without ID labels, unveiling conditions for achieving enhanced OOD detection. Lastly, we present empirical results on both simulated and real datasets, validating theoretical guarantees and reinforcing our insights. Code is publicly available at https://github.com/deeplearning-wisc/id_label.

When and How Does In-Distribution Label Help Out-of-Distribution Detection?

TL;DR

Abstract

Paper Structure (47 sections, 12 theorems, 92 equations, 3 figures, 9 tables)

This paper contains 47 sections, 12 theorems, 92 equations, 3 figures, 9 tables.

Introduction
Problem Setup
Analysis Framework
Overview of rationale.
Graph Formulation
Learning Representations Based on Graph Spectral
A surrogate objective.
Interpretation.
Theoretical Results
Representation for ID and OOD Data
ID representations.
OOD representations.
Representation in the unlabeled case.
An illustrative example.
Evaluation Target
...and 32 more sections

Key Result

Lemma 1

We define each row $\mathbf{f}_\mathbf{x}^{\top}$ of $\mathbf{F}^{(l)}$ as a scaled version of learned feature representation $\mathbf{h}_\mathbf{w}$, with $\mathbf{f}_\mathbf{x} = \sqrt{\zeta_\mathbf{x}}\mathbf{h}_\mathbf{w}(\mathbf{x})$. Then minimizing the loss function $\mathcal{L}(\mathbf{F}^{(

Figures (3)

Figure 1: Intuitive example on the ID labels' impact on OOD detection. (a) In the near OOD scenario where the OOD data connects densely with the ID data, without ID labels, the neural network produces indistinguishable embeddings for the ID (Ragdoll and Sphynx class) and OOD data (Scottish Fold class). By harnessing the power of the ID labeling information, the model learns more distinguishable embeddings that help ID vs. OOD separation. (b) In the far OOD scenario (Dog class), ID labels can be less beneficial because the representations learned in an unsupervised manner can already be separable between ID vs OOD.
Figure 2: Example showcasing the contrast between adjacency matrices and representations w/ (l) and w/o (u) ID labels. (a) The ID adjacency matrix in the labeled case $\mathbf{A}^{(l)}$. (b) The ID adjacency matrix in the unlabeled case $\mathbf{A}^{(u)}$. Here darker color indicates denser connectivity. The contrast of the OOD-ID adjacency matrix $\tilde{\mathbf{A}}_{\rm OI}$ w/ and w/o ID labels in the near OOD and far OOD scenario is shown in (c) and (d), where the adjacency matrices have a larger Frobenius norm, i.e., $\|\tilde{\mathbf{A}}_{\rm OI}^{(u)}\|_F=60$ in the near OOD scenario and smaller norm in the far OOD scenario, i.e., $\|\tilde{\mathbf{A}}_{\rm OI}^{(u)}\|_F=24$. (e) Learned representations in the near OOD scenario, where the OOD representations are overlapped in the unlabeled case but become linearly separable from the ID representations in the labeled case. (d) Representations in the far OOD scenario. The ID and OOD representations can already be separable in the unlabeled case. The benefit of ID labels is marginal.
Figure 3: Additional example showcasing the contrast between adjacency matrices and representations w/ (l) and w/o (u) ID labels. (a) The ID adjacency matrix in the unlabeled case $\mathbf{A}^{(u)}$ with a larger $\|\tilde{\mathbf{A}}^{(u)}\|_F$ ($B_1= 0.8, B_2=0.1, B_3=0.75, B_4=0.7$). (b) The contrast of the learned representations in both labeled and unlabeled cases when $\|\tilde{\mathbf{A}}^{(u)}\|_F=1.20$. (c) The ID adjacency matrix in the unlabeled case $\mathbf{A}^{(u)}$ with a smaller $\|\tilde{\mathbf{A}}^{(u)}\|_F$ ($B_1= 0.7, B_2=0.1, B_3=0.65, B_4=0.6$). (d) The contrast of the learned representations in both labeled and unlabeled cases when $\|\tilde{\mathbf{A}}^{(u)}\|_F=1.16$. Compared with (b) where the difference in the linear probing loss $\mathcal{G}$ is 0.09, the linear probing loss reduces from 0.16 to 0.02. (e) The ID adjacency matrix in the unlabeled case $\mathbf{A}^{(u)}$ with a smaller $\|\mathfrak{q}\|_F$ ($B_1= 0.7, B_2=0.1, B_3=0.65, B_4=0.6$). (f) The contrast of the learned representations in both labeled and unlabeled cases when $\|\mathfrak{q}\|_F=7.30$. (g) The ID adjacency matrix in the unlabeled case $\mathbf{A}^{(u)}$ with a larger $\|\mathfrak{q}\|_F$ ($B_1= 0.8, B_2=0.2, B_3=0.75, B_4=0.7$). (h) The contrast of the learned representations in both labeled and unlabeled cases when $\|\mathfrak{q}\|_F=8.79$. Compared with (f) where the difference in the linear probing loss $\mathcal{G}$ is 0.14, the linear probing loss reduces from 0.16 to 0.00. The visualization aligns with our theoretical reasoning as shown in Section \ref{['sec:4.3']}.

Theorems & Definitions (27)

Definition 1: Out-of-Distribution Detection w/ ID Labels
Definition 2: Out-of-Distribution Detection w/o ID Labels
Definition 3: Unlabeled case (u)
Definition 4: Labeled case (l)
Definition 5: Adjacency matrix for unlabeled ID data
Definition 6: Adjacency matrix for labeled ID data
Lemma 1: Theoretical equivalence between two objectives
Remark 1
Lemma 2
Theorem 1: Lower bound of the linear probing error difference w/ and w/o ID labels
...and 17 more

When and How Does In-Distribution Label Help Out-of-Distribution Detection?

TL;DR

Abstract

When and How Does In-Distribution Label Help Out-of-Distribution Detection?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (27)