Table of Contents
Fetching ...

Equipping Vision Foundation Model with Mixture of Experts for Out-of-Distribution Detection

Shizhen Zhao, Jiahui Liu, Xin Wen, Haoru Tan, Xiaojuan Qi

TL;DR

This work systematically evaluates vision foundation models for out-of-distribution detection and finds DINOv2 provides a highly discriminative feature space even without fine-tuning. It shows that fine-tuning on large semantic spaces can degrade OOD performance and proposes Mixture of Feature Experts (MoFE) to partition the feature space into subspaces with specialized experts, complemented by Dynamic-$\beta$ Mixup to adapt augmentation to category difficulty. The combination yields significant gains over strong baselines on ImageNet-1K/100 and diverse OOD benchmarks, validating subspace specialization and adaptive augmentation as effective strategies for robust OOD detection in foundation-model regimes. The results highlight the importance of discriminative, well-generalizing features and demonstrate practical impact for reliable OOD detection in real-world vision systems.

Abstract

Pre-trained vision foundation models have transformed many computer vision tasks. Despite their strong ability to learn discriminative and generalizable features crucial for out-of-distribution (OOD) detection, their impact on this task remains underexplored. Motivated by this gap, we systematically investigate representative vision foundation models for OOD detection. Our findings reveal that a pre-trained DINOv2 model, even without fine-tuning on in-domain (ID) data, naturally provides a highly discriminative feature space for OOD detection, achieving performance comparable to existing state-of-the-art methods without requiring complex designs. Beyond this, we explore how fine-tuning foundation models on in-domain (ID) data can enhance OOD detection. However, we observe that the performance of vision foundation models remains unsatisfactory in scenarios with a large semantic space. This is due to the increased complexity of decision boundaries as the number of categories grows, which complicates the optimization process. To mitigate this, we propose the Mixture of Feature Experts (MoFE) module, which partitions features into subspaces, effectively capturing complex data distributions and refining decision boundaries. Further, we introduce a Dynamic-$β$ Mixup strategy, which samples interpolation weights from a dynamic beta distribution. This adapts to varying levels of learning difficulty across categories, improving feature learning for more challenging categories. Extensive experiments demonstrate the effectiveness of our approach, significantly outperforming baseline methods.

Equipping Vision Foundation Model with Mixture of Experts for Out-of-Distribution Detection

TL;DR

This work systematically evaluates vision foundation models for out-of-distribution detection and finds DINOv2 provides a highly discriminative feature space even without fine-tuning. It shows that fine-tuning on large semantic spaces can degrade OOD performance and proposes Mixture of Feature Experts (MoFE) to partition the feature space into subspaces with specialized experts, complemented by Dynamic- Mixup to adapt augmentation to category difficulty. The combination yields significant gains over strong baselines on ImageNet-1K/100 and diverse OOD benchmarks, validating subspace specialization and adaptive augmentation as effective strategies for robust OOD detection in foundation-model regimes. The results highlight the importance of discriminative, well-generalizing features and demonstrate practical impact for reliable OOD detection in real-world vision systems.

Abstract

Pre-trained vision foundation models have transformed many computer vision tasks. Despite their strong ability to learn discriminative and generalizable features crucial for out-of-distribution (OOD) detection, their impact on this task remains underexplored. Motivated by this gap, we systematically investigate representative vision foundation models for OOD detection. Our findings reveal that a pre-trained DINOv2 model, even without fine-tuning on in-domain (ID) data, naturally provides a highly discriminative feature space for OOD detection, achieving performance comparable to existing state-of-the-art methods without requiring complex designs. Beyond this, we explore how fine-tuning foundation models on in-domain (ID) data can enhance OOD detection. However, we observe that the performance of vision foundation models remains unsatisfactory in scenarios with a large semantic space. This is due to the increased complexity of decision boundaries as the number of categories grows, which complicates the optimization process. To mitigate this, we propose the Mixture of Feature Experts (MoFE) module, which partitions features into subspaces, effectively capturing complex data distributions and refining decision boundaries. Further, we introduce a Dynamic- Mixup strategy, which samples interpolation weights from a dynamic beta distribution. This adapts to varying levels of learning difficulty across categories, improving feature learning for more challenging categories. Extensive experiments demonstrate the effectiveness of our approach, significantly outperforming baseline methods.

Paper Structure

This paper contains 23 sections, 12 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Holistic comparison to previous philosophy. (a) Traditional methods use a generalized model to project inputs onto a complex distribution; (b): Our approach leverages multiple experts to break the complex distribution into smaller ones, which leads to compact ID distribution and simplified decision boundary.
  • Figure 2: Performance of vision foundation models across different OOD splits. The evaluation metric is FPR95, with lower values indicating better performance.
  • Figure 3: Feature Visualization for Foundation Models. For fine-grained feature visualization, we randomly select fine-grained categories under 3 different super classes from ImageNet-1k.
  • Figure 4: Illustration of our proposed Mixture of Feature Experts (MoFE). MoFE decomposes the large semantic space into multiple subspaces and each expert specializes in a specific subspace. Specifically, the image patches and the class token are input to obtain the preliminary patch embeddings and class embedding. A router is employed to determine the expert to further process the embeddings, and the input of the router is the class embedding. Finally, we apply associated experts to refine the class embeddings and the patch embeddings. We use the class embeddings output by MoFE and conduct the OOD detection in the corresponding subspace.
  • Figure 5: Visualization of feature space of MoFE and MOS. It can be observe that, trained with MOS, the outlier features are still mingled with in-domain data, while MoFE can well separate the in- and out-of-distribution data.
  • ...and 2 more figures