Equipping Vision Foundation Model with Mixture of Experts for Out-of-Distribution Detection
Shizhen Zhao, Jiahui Liu, Xin Wen, Haoru Tan, Xiaojuan Qi
TL;DR
This work systematically evaluates vision foundation models for out-of-distribution detection and finds DINOv2 provides a highly discriminative feature space even without fine-tuning. It shows that fine-tuning on large semantic spaces can degrade OOD performance and proposes Mixture of Feature Experts (MoFE) to partition the feature space into subspaces with specialized experts, complemented by Dynamic-$\beta$ Mixup to adapt augmentation to category difficulty. The combination yields significant gains over strong baselines on ImageNet-1K/100 and diverse OOD benchmarks, validating subspace specialization and adaptive augmentation as effective strategies for robust OOD detection in foundation-model regimes. The results highlight the importance of discriminative, well-generalizing features and demonstrate practical impact for reliable OOD detection in real-world vision systems.
Abstract
Pre-trained vision foundation models have transformed many computer vision tasks. Despite their strong ability to learn discriminative and generalizable features crucial for out-of-distribution (OOD) detection, their impact on this task remains underexplored. Motivated by this gap, we systematically investigate representative vision foundation models for OOD detection. Our findings reveal that a pre-trained DINOv2 model, even without fine-tuning on in-domain (ID) data, naturally provides a highly discriminative feature space for OOD detection, achieving performance comparable to existing state-of-the-art methods without requiring complex designs. Beyond this, we explore how fine-tuning foundation models on in-domain (ID) data can enhance OOD detection. However, we observe that the performance of vision foundation models remains unsatisfactory in scenarios with a large semantic space. This is due to the increased complexity of decision boundaries as the number of categories grows, which complicates the optimization process. To mitigate this, we propose the Mixture of Feature Experts (MoFE) module, which partitions features into subspaces, effectively capturing complex data distributions and refining decision boundaries. Further, we introduce a Dynamic-$β$ Mixup strategy, which samples interpolation weights from a dynamic beta distribution. This adapts to varying levels of learning difficulty across categories, improving feature learning for more challenging categories. Extensive experiments demonstrate the effectiveness of our approach, significantly outperforming baseline methods.
