How to train your ViT for OOD Detection

Maximilian Mueller; Matthias Hein

How to train your ViT for OOD Detection

Maximilian Mueller, Matthias Hein

TL;DR

This paper tackles how pretraining and finetuning schemes shape Vision Transformer performance for out-of-distribution detection. It conducts a large-scale empirical study across ViT variants from two public pools, evaluating post-hoc detectors such as MaxLogit, Mahalanobis, and Relative Mahalanobis on NINCO and unit-test OOD benchmarks, with performance reported via $FPR$ and $AUC$. The key finding is that ImageNet-21k pretraining combined with careful finetuning (notably large weight decay during pretraining and a small learning rate during finetuning) yields robust Mahalanobis-based detectors, but effectiveness is highly sensitive to hyperparameters and the type of OOD data. CLIP pretraining does not generally improve feature-based detectors, and finetuning remains essential to realize strong OOD performance, leading to a practical best-practice recipe for ViT-based OOD detection in real-world settings.

Abstract

VisionTransformers have been shown to be powerful out-of-distribution detectors for ImageNet-scale settings when finetuned from publicly available checkpoints, often outperforming other model types on popular benchmarks. In this work, we investigate the impact of both the pretraining and finetuning scheme on the performance of ViTs on this task by analyzing a large pool of models. We find that the exact type of pretraining has a strong impact on which method works well and on OOD detection performance in general. We further show that certain training schemes might only be effective for a specific type of out-distribution, but not in general, and identify a best-practice training recipe.

How to train your ViT for OOD Detection

TL;DR

and

. The key finding is that ImageNet-21k pretraining combined with careful finetuning (notably large weight decay during pretraining and a small learning rate during finetuning) yields robust Mahalanobis-based detectors, but effectiveness is highly sensitive to hyperparameters and the type of OOD data. CLIP pretraining does not generally improve feature-based detectors, and finetuning remains essential to realize strong OOD performance, leading to a practical best-practice recipe for ViT-based OOD detection in real-world settings.

Abstract

Paper Structure (13 sections, 14 equations, 7 figures, 4 tables)

This paper contains 13 sections, 14 equations, 7 figures, 4 tables.

Introduction
Experimental Setup
Observations
Conclusions
Supplementary Material
Experimental details
Additional Results
Pretraining plots with AUC
Detailed results for ViT-B/16 with ImageNet-21k pretraining
Clip Pretraining
Evaluating more methods
Methods
Definitions of OOD detection metrics

Figures (7)

Figure 1: Pretraining matters: ImageNet-21k pretraining paired with Mahalanobis-based detection methods strongly outperforms other detectors on NINCO, yet fails in many cases on the unit-test task.
Figure 2: ViT-B/16 trained exclusively on ImageNet-1k
Figure 4: Pretraining matters: ImageNet-21k pretraining paired with Mahalanobis-based detection methods strongly outperforms other detectors on NINCO, yet fails in many cases on the unit-test task.
Figure 5: Clip Models. Souped models are shown in orange.
Figure 6: ImageNet-1k models pretrained on ImageNet-21k with more methods (logit-based on the top, feature-based on the bottom).
...and 2 more figures

How to train your ViT for OOD Detection

TL;DR

Abstract

How to train your ViT for OOD Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (7)