Table of Contents
Fetching ...

Double Descent Meets Out-of-Distribution Detection: Theoretical Insights and Empirical Analysis on the role of model complexity

Mouïn Ben Ammar, David Brellmann, Arturo Mendoza, Antoine Manzanera, Gianni Franchi

TL;DR

The work shows that double descent, previously studied for ID generalization, also appears in post-hoc OOD detection across CNNs and transformers. It combines theory via random matrix theory in a Gaussian covariate model with extensive empirical validation on diverse architectures and OOD datasets, revealing that the OOD risk peaks near the interpolation threshold and that overparameterization is not universally beneficial. A Neural Collapse–based NC1 criterion is proposed to identify regimes where smaller models may outperform larger ones for OOD detection. The findings highlight the critical role of latent representation geometry in OOD detection and offer practical guidance for selecting model complexity in open-world settings.

Abstract

Out-of-distribution (OOD) detection is essential for ensuring the reliability and safety of machine learning systems. In recent years, it has received increasing attention, particularly through post-hoc detection and training-based methods. In this paper, we focus on post-hoc OOD detection, which enables identifying OOD samples without altering the model's training procedure or objective. Our primary goal is to investigate the relationship between model capacity and its OOD detection performance. Specifically, we aim to answer the following question: Does the Double Descent phenomenon manifest in post-hoc OOD detection? This question is crucial, as it can reveal whether overparameterization, which is already known to benefit generalization, can also enhance OOD detection. Despite the growing interest in these topics by the classic supervised machine learning community, this intersection remains unexplored for OOD detection. We empirically demonstrate that the Double Descent effect does indeed appear in post-hoc OOD detection. Furthermore, we provide theoretical insights to explain why this phenomenon emerges in such setting. Finally, we show that the overparameterized regime does not yield superior results consistently, and we propose a method to identify the optimal regime for OOD detection based on our observations.

Double Descent Meets Out-of-Distribution Detection: Theoretical Insights and Empirical Analysis on the role of model complexity

TL;DR

The work shows that double descent, previously studied for ID generalization, also appears in post-hoc OOD detection across CNNs and transformers. It combines theory via random matrix theory in a Gaussian covariate model with extensive empirical validation on diverse architectures and OOD datasets, revealing that the OOD risk peaks near the interpolation threshold and that overparameterization is not universally beneficial. A Neural Collapse–based NC1 criterion is proposed to identify regimes where smaller models may outperform larger ones for OOD detection. The findings highlight the critical role of latent representation geometry in OOD detection and offer practical guidance for selecting model complexity in open-world settings.

Abstract

Out-of-distribution (OOD) detection is essential for ensuring the reliability and safety of machine learning systems. In recent years, it has received increasing attention, particularly through post-hoc detection and training-based methods. In this paper, we focus on post-hoc OOD detection, which enables identifying OOD samples without altering the model's training procedure or objective. Our primary goal is to investigate the relationship between model capacity and its OOD detection performance. Specifically, we aim to answer the following question: Does the Double Descent phenomenon manifest in post-hoc OOD detection? This question is crucial, as it can reveal whether overparameterization, which is already known to benefit generalization, can also enhance OOD detection. Despite the growing interest in these topics by the classic supervised machine learning community, this intersection remains unexplored for OOD detection. We empirically demonstrate that the Double Descent effect does indeed appear in post-hoc OOD detection. Furthermore, we provide theoretical insights to explain why this phenomenon emerges in such setting. Finally, we show that the overparameterized regime does not yield superior results consistently, and we propose a method to identify the optimal regime for OOD detection based on our observations.

Paper Structure

This paper contains 69 sections, 5 theorems, 77 equations, 24 figures, 4 tables.

Key Result

Theorem 1

Let $(p, q) \in [d]^2$ such that $p + q = d$, $\mathcal{T} \subseteq [d]$ with $\lvert \mathcal{T} \rvert = p$ an arbitrary subset of the $d$ first natural integers, and $\mathcal{T}^c := [d]\,\backslash\,\mathcal{T}$ its complement set. Let $\hat{{\bm{w}}} \in R^{d}(\mathcal{T})$ such that $\hat{{\ where $c, C>0$ and

Figures (24)

  • Figure 1: Illustration of the double descent phenomenon in a Random ReLU Feature model as a function of model width (log scale). (a) Evolution of the in-distribution (ID) Mean-Squared Error (MSE) and OOD detection risk. (b) Confidence-based OOD score defined in equation \ref{['def:used_ood_score']}. (c) Three-dimensional t-SNE projection of the input space, visualizing the separation between ID and OOD samples. The ID samples ($n=1{,}000$) are drawn from a Gaussian Mixture Model (GMM) fitted to a subset of CIFAR-10, while OOD samples are drawn from an independent GMM.
  • Figure 2: Accuracy and AUC OOD detection metric versus model width. Experiments performed on CNN, ResNet-18, ViT and Swin architectures with CIFAR10 and CIFAR100 as ID and OOD datasets.
  • Figure 3: Distribution of the values of the OOD scoring function g(x) defined in equation \ref{['def:used_ood_score']}, evaluated on Random ReLU Feature (RRF) models with varying widths. (a) Score distributions for underparameterized models with widths, from left to right, of 100 and 500 respectively. (b) Score distribution at the interpolation threshold (width = 1 000), where performance degrades sharply. (c) Score distributions for overparameterized models with widths, from left to right, of 5 000 and 10 000 respectively.
  • Figure 4: OOD detection (AUC) metric versus model width. Experiments performed on CNN, ResNet-18, ViT, and Swin with CIFAR10 as ID dataset and ImageNet-O as OOD dataset.
  • Figure 5: OOD detection (AUC) metric versus model width. Experiments performed on CNN, ResNet-18, ViT and Swin with CIFAR10 as ID dataset and Texture as OOD dataset.
  • ...and 19 more figures

Theorems & Definitions (16)

  • Remark 4.1
  • Remark 4.2
  • Remark 4.3
  • Theorem 1
  • Remark 4.4
  • Remark 4.5
  • Theorem 2
  • Remark A.1
  • Remark A.2
  • proof
  • ...and 6 more