SOTA: Self-adaptive Optimal Transport for Zero-Shot Classification with Multiple Foundation Models
Zhanxuan Hu, Qiyu Xu, Yu Duan, Yonghang Tai, Huafeng Li
TL;DR
Extensive experiments across diverse domains, including natural images, medical pathology, and remote sensing, validate the generalizability of the proposed SOTA, a training-free ensemble framework that integrates the outputs of multiple foundation models(VFMs or VLMs) by learning a self-adaptive transport plan.
Abstract
Foundation models have attracted widespread attention across domains due to their powerful zero-shot classification capabilities. This work is motivated by two key observations: (1) \textit{Vision-Language Models} (VLMs), such as CLIP, often over-rely on class-level textual priors and struggle to capture fine-grained visual cues, whereas \textit{Vision-only Foundation Models} (VFMs), such as DINO, provide rich and discriminative visual features but lack semantic alignment; (2) the performance of different VLMs varies considerably across datasets owing to differences in pre-training. To address these challenges, we propose \textbf{SOTA} (\textit{Self-adaptive Optimal TrAnsport}), a \textit{training-free} ensemble framework that integrates the outputs of multiple foundation models~(VFMs or VLMs) by learning a self-adaptive transport plan. Notably, \textbf{SOTA} is prior-free and automatically balances model contributions. Extensive experiments across diverse domains, including natural images, medical pathology, and remote sensing, validate the generalizability of \textbf{SOTA}. The results consistently show that it effectively leverages the complementary strengths of different foundation models and achieves substantial improvements over individual models. The implementation code is available at: https://github.com/Afleve/self-adaptive-Optimal-Transport.
