Table of Contents
Fetching ...

PViT: Prior-augmented Vision Transformer for Out-of-distribution Detection

Tianhao Zhang, Zhixiang Chen, Lyudmila S. Mihaylova

TL;DR

PViT introduces a Prior-augmented Vision Transformer that leverages priors from a pretrained model to improve OOD detection for Vision Transformers. By adding a prior token and formulating a Prior Guide Energy score, the model aligns predictions with priors on in-distribution data while increasing divergence for out-of-distribution samples. Extensive experiments on ImageNet-1K with seven OOD benchmarks show substantial gains in FPR95 and AUROC without synthetic outlier generation, and ablations validate the effectiveness of the prior-guidance term and prior token. This approach provides a scalable, data-efficient mechanism to imbue ViTs with a useful inductive bias for safety-critical vision tasks, and ports well to large vision models with minimal architectural changes.

Abstract

Vision Transformers (ViTs) have achieved remarkable success over various vision tasks, yet their robustness against data distribution shifts and inherent inductive biases remain underexplored. To enhance the robustness of ViT models for image Out-of-Distribution (OOD) detection, we introduce a novel and generic framework named Prior-augmented Vision Transformer (PViT). Taking as input the prior class logits from a pretrained model, we train PViT to predict the class logits. During inference, PViT identifies OOD samples by quantifying the divergence between the predicted class logits and the prior logits obtained from pre-trained models. Unlike existing state-of-the-art(SOTA) OOD detection methods, PViT shapes the decision boundary between ID and OOD by utilizing the proposed prior guided confidence, without requiring additional data modeling, generation methods, or structural modifications. Extensive experiments on the large-scale ImageNet benchmark, evaluated against over seven OOD datasets, demonstrate that PViT significantly outperforms existing SOTA OOD detection methods in terms of FPR95 and AUROC. The codebase is publicly available at https://github.com/RanchoGoose/PViT.

PViT: Prior-augmented Vision Transformer for Out-of-distribution Detection

TL;DR

PViT introduces a Prior-augmented Vision Transformer that leverages priors from a pretrained model to improve OOD detection for Vision Transformers. By adding a prior token and formulating a Prior Guide Energy score, the model aligns predictions with priors on in-distribution data while increasing divergence for out-of-distribution samples. Extensive experiments on ImageNet-1K with seven OOD benchmarks show substantial gains in FPR95 and AUROC without synthetic outlier generation, and ablations validate the effectiveness of the prior-guidance term and prior token. This approach provides a scalable, data-efficient mechanism to imbue ViTs with a useful inductive bias for safety-critical vision tasks, and ports well to large vision models with minimal architectural changes.

Abstract

Vision Transformers (ViTs) have achieved remarkable success over various vision tasks, yet their robustness against data distribution shifts and inherent inductive biases remain underexplored. To enhance the robustness of ViT models for image Out-of-Distribution (OOD) detection, we introduce a novel and generic framework named Prior-augmented Vision Transformer (PViT). Taking as input the prior class logits from a pretrained model, we train PViT to predict the class logits. During inference, PViT identifies OOD samples by quantifying the divergence between the predicted class logits and the prior logits obtained from pre-trained models. Unlike existing state-of-the-art(SOTA) OOD detection methods, PViT shapes the decision boundary between ID and OOD by utilizing the proposed prior guided confidence, without requiring additional data modeling, generation methods, or structural modifications. Extensive experiments on the large-scale ImageNet benchmark, evaluated against over seven OOD datasets, demonstrate that PViT significantly outperforms existing SOTA OOD detection methods in terms of FPR95 and AUROC. The codebase is publicly available at https://github.com/RanchoGoose/PViT.

Paper Structure

This paper contains 20 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: A brief overview of the proposed method. The ID images are taken from IMAGENET-1K, and the OOD images are sourced from Openimage-O. The distinction between ID and OOD images is made by measuring the difference between the prior logits and the predicted logits.
  • Figure 2: Framework of our proposed PViT. During the training stage, PViT processes the ID image patches $\mathcal{D}_{\text{in}}^{\text{train}}$ alongside the prior token $\mathbf{T}_{\text{prior}}$, which embeds prior knowledge from the pre-trained prior model. During testing, the prior model $\theta_{\text{prior}}$ continues to provide the prior logits for the OOD data $\mathcal{D}_{\text{out}}^{\text{test}}$ to PViT. The predicted class logits are then used to calculate the prior-guided OOD score, enabling the differentiation between ID and OOD data. Other components, including position embeddings, the classification (cls) token, and the flattening of image patches, follow the implementation of the vanilla ViT dosovitskiy2020image.
  • Figure 3: Ablation study on different scoring rules of PViT. The ID data is IMAGENET-1k. The results are average results over seven OOD datasets.
  • Figure 4: Score distributions with IMAGENET-1K as ID data and iNaturalist as OOD data. The scores are calculated by PViT with ViT-LP as the prior model.
  • Figure 5: Visualization of attention maps with varying scaling factors $\alpha$ for prior token embedding, generated from the last layer and the first MSA head. The attention weight for the prior token is highlighted with a red line on the color bar. The first two rows of figures are taken from IMAGENET-1K, representing ID data. The third row, representing the OOD data, which is taken from OpenImage_O, illustrates the differential responses of PViT to both ID and OOD data. Labels above the original figure on the left indicate predictions made by the prior model, while the labels on the right correspond to the predictions made by PViT.