Table of Contents
Fetching ...

MOODv2: Masked Image Modeling for Out-of-Distribution Detection

Jingyao Li, Pengguang Chen, Shaozuo Yu, Shu Liu, Jiaya Jia

TL;DR

This work tackles OOD detection by prioritizing high-quality in-distribution representations over purely recognition-based cues. It advocates reconstruction-based pretraining through masked image modeling (MIM) to learn pixel-level ID features and employs a ViM score that fuses features and logits for robust OOD scoring. MOODv2 extends the prior MOOD framework with newer pretraining methods and a broader set of OOD scores, achieving substantial gains (e.g., 95.68% AUROC on ImageNet and 99.98% on CIFAR-10) and reducing the gap between simple and complex scores. The approach demonstrates that reconstruction-based pretext tasks yield strong, dataset-agnostic ID representations, offering practical benefits for safe and reliable visual recognition in real-world applications.

Abstract

The crux of effective out-of-distribution (OOD) detection lies in acquiring a robust in-distribution (ID) representation, distinct from OOD samples. While previous methods predominantly leaned on recognition-based techniques for this purpose, they often resulted in shortcut learning, lacking comprehensive representations. In our study, we conducted a comprehensive analysis, exploring distinct pretraining tasks and employing various OOD score functions. The results highlight that the feature representations pre-trained through reconstruction yield a notable enhancement and narrow the performance gap among various score functions. This suggests that even simple score functions can rival complex ones when leveraging reconstruction-based pretext tasks. Reconstruction-based pretext tasks adapt well to various score functions. As such, it holds promising potential for further expansion. Our OOD detection framework, MOODv2, employs the masked image modeling pretext task. Without bells and whistles, MOODv2 impressively enhances 14.30% AUROC to 95.68% on ImageNet and achieves 99.98% on CIFAR-10.

MOODv2: Masked Image Modeling for Out-of-Distribution Detection

TL;DR

This work tackles OOD detection by prioritizing high-quality in-distribution representations over purely recognition-based cues. It advocates reconstruction-based pretraining through masked image modeling (MIM) to learn pixel-level ID features and employs a ViM score that fuses features and logits for robust OOD scoring. MOODv2 extends the prior MOOD framework with newer pretraining methods and a broader set of OOD scores, achieving substantial gains (e.g., 95.68% AUROC on ImageNet and 99.98% on CIFAR-10) and reducing the gap between simple and complex scores. The approach demonstrates that reconstruction-based pretext tasks yield strong, dataset-agnostic ID representations, offering practical benefits for safe and reliable visual recognition in real-world applications.

Abstract

The crux of effective out-of-distribution (OOD) detection lies in acquiring a robust in-distribution (ID) representation, distinct from OOD samples. While previous methods predominantly leaned on recognition-based techniques for this purpose, they often resulted in shortcut learning, lacking comprehensive representations. In our study, we conducted a comprehensive analysis, exploring distinct pretraining tasks and employing various OOD score functions. The results highlight that the feature representations pre-trained through reconstruction yield a notable enhancement and narrow the performance gap among various score functions. This suggests that even simple score functions can rival complex ones when leveraging reconstruction-based pretext tasks. Reconstruction-based pretext tasks adapt well to various score functions. As such, it holds promising potential for further expansion. Our OOD detection framework, MOODv2, employs the masked image modeling pretext task. Without bells and whistles, MOODv2 impressively enhances 14.30% AUROC to 95.68% on ImageNet and achieves 99.98% on CIFAR-10.
Paper Structure (18 sections, 1 equation, 7 figures, 9 tables, 1 algorithm)

This paper contains 18 sections, 1 equation, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: The average AUROC (%) tested on four OOD datasets applied to a ViT model with different pre-text tasks. Methods in blue use the feature space; methods in green use logits; methods in yellow use the softmax probability; and methods in red use both features and logits. The stars show the average performance of a category of methods.
  • Figure 2: Comparison of reconstruction-based and classification-based methods. In the context of image classification, networks often take a shortcut when categorizing images backdoor_attackfrog_attack. For example, ears are a distinctive feature for distinguishing between cats and dogs, and a classification model typically assumes that animals with pointed ears are cats, while those without are dogs. Consequently, when the network encounters an out-of-distribution animal, such as a fox with pointed ears, it readily misclassifies it as a cat. In contrast, reconstruction-based tasks effectively mitigate this issue. By randomly masking portions of images, the model avoids learning localized, stereotypical features (e.g., masked ears), thus preventing shortcuts and instead acquiring effective pixel-level representations for ID data. This significantly improves the model's ability to detect OOD instances.
  • Figure 3: The AUROC (%) of MOODv2 and MOODv1 tested on four OOD datasets, including OpenImage-O openimages_o, Texture dtd, iNaturalist inaturalist, and ImageNet-O imagenet_o.
  • Figure 4: Each image pair consists of the original image (left) and reconstructed image (right). The rows of images are sourced from ImageNet imagenet, Texture cimpoi14describing, iNaturalist van2018inaturalist, ImageNet-O hendrycks2021natural, and OpenImage-O openimages_o. The number in the top left corner of each image pair represents the Euclidean distance between the two images.
  • Figure 5: The AUROC (%) tested on unnatural OOD datasets of various OOD detection algorithms applied to a ViT model. The pre-text tasks include classification task vit, contrastive learning tasks MoCov3 mocov3 and DINOv2 dinov2, and masked image modeling tasks BEiT series beitbeitv2. Methods in blue utilize the feature space; methods in green use logits; methods in yellow make use of the softmax probability. and methods in red leverage both features and logits. Stars represent the average AUROC for methods in the corresponding colors; light vertical lines represent the standard deviation.
  • ...and 2 more figures