How to train your ViT for OOD Detection
Maximilian Mueller, Matthias Hein
TL;DR
This paper tackles how pretraining and finetuning schemes shape Vision Transformer performance for out-of-distribution detection. It conducts a large-scale empirical study across ViT variants from two public pools, evaluating post-hoc detectors such as MaxLogit, Mahalanobis, and Relative Mahalanobis on NINCO and unit-test OOD benchmarks, with performance reported via $FPR$ and $AUC$. The key finding is that ImageNet-21k pretraining combined with careful finetuning (notably large weight decay during pretraining and a small learning rate during finetuning) yields robust Mahalanobis-based detectors, but effectiveness is highly sensitive to hyperparameters and the type of OOD data. CLIP pretraining does not generally improve feature-based detectors, and finetuning remains essential to realize strong OOD performance, leading to a practical best-practice recipe for ViT-based OOD detection in real-world settings.
Abstract
VisionTransformers have been shown to be powerful out-of-distribution detectors for ImageNet-scale settings when finetuned from publicly available checkpoints, often outperforming other model types on popular benchmarks. In this work, we investigate the impact of both the pretraining and finetuning scheme on the performance of ViTs on this task by analyzing a large pool of models. We find that the exact type of pretraining has a strong impact on which method works well and on OOD detection performance in general. We further show that certain training schemes might only be effective for a specific type of out-distribution, but not in general, and identify a best-practice training recipe.
