Table of Contents
Fetching ...

How to train your ViT for OOD Detection

Maximilian Mueller, Matthias Hein

TL;DR

This paper tackles how pretraining and finetuning schemes shape Vision Transformer performance for out-of-distribution detection. It conducts a large-scale empirical study across ViT variants from two public pools, evaluating post-hoc detectors such as MaxLogit, Mahalanobis, and Relative Mahalanobis on NINCO and unit-test OOD benchmarks, with performance reported via $FPR$ and $AUC$. The key finding is that ImageNet-21k pretraining combined with careful finetuning (notably large weight decay during pretraining and a small learning rate during finetuning) yields robust Mahalanobis-based detectors, but effectiveness is highly sensitive to hyperparameters and the type of OOD data. CLIP pretraining does not generally improve feature-based detectors, and finetuning remains essential to realize strong OOD performance, leading to a practical best-practice recipe for ViT-based OOD detection in real-world settings.

Abstract

VisionTransformers have been shown to be powerful out-of-distribution detectors for ImageNet-scale settings when finetuned from publicly available checkpoints, often outperforming other model types on popular benchmarks. In this work, we investigate the impact of both the pretraining and finetuning scheme on the performance of ViTs on this task by analyzing a large pool of models. We find that the exact type of pretraining has a strong impact on which method works well and on OOD detection performance in general. We further show that certain training schemes might only be effective for a specific type of out-distribution, but not in general, and identify a best-practice training recipe.

How to train your ViT for OOD Detection

TL;DR

This paper tackles how pretraining and finetuning schemes shape Vision Transformer performance for out-of-distribution detection. It conducts a large-scale empirical study across ViT variants from two public pools, evaluating post-hoc detectors such as MaxLogit, Mahalanobis, and Relative Mahalanobis on NINCO and unit-test OOD benchmarks, with performance reported via and . The key finding is that ImageNet-21k pretraining combined with careful finetuning (notably large weight decay during pretraining and a small learning rate during finetuning) yields robust Mahalanobis-based detectors, but effectiveness is highly sensitive to hyperparameters and the type of OOD data. CLIP pretraining does not generally improve feature-based detectors, and finetuning remains essential to realize strong OOD performance, leading to a practical best-practice recipe for ViT-based OOD detection in real-world settings.

Abstract

VisionTransformers have been shown to be powerful out-of-distribution detectors for ImageNet-scale settings when finetuned from publicly available checkpoints, often outperforming other model types on popular benchmarks. In this work, we investigate the impact of both the pretraining and finetuning scheme on the performance of ViTs on this task by analyzing a large pool of models. We find that the exact type of pretraining has a strong impact on which method works well and on OOD detection performance in general. We further show that certain training schemes might only be effective for a specific type of out-distribution, but not in general, and identify a best-practice training recipe.
Paper Structure (13 sections, 14 equations, 7 figures, 4 tables)

This paper contains 13 sections, 14 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Pretraining matters: ImageNet-21k pretraining paired with Mahalanobis-based detection methods strongly outperforms other detectors on NINCO, yet fails in many cases on the unit-test task.
  • Figure 2: ViT-B/16 trained exclusively on ImageNet-1k
  • Figure 4: Pretraining matters: ImageNet-21k pretraining paired with Mahalanobis-based detection methods strongly outperforms other detectors on NINCO, yet fails in many cases on the unit-test task.
  • Figure 5: Clip Models. Souped models are shown in orange.
  • Figure 6: ImageNet-1k models pretrained on ImageNet-21k with more methods (logit-based on the top, feature-based on the bottom).
  • ...and 2 more figures