Table of Contents
Fetching ...

SOTA: Self-adaptive Optimal Transport for Zero-Shot Classification with Multiple Foundation Models

Zhanxuan Hu, Qiyu Xu, Yu Duan, Yonghang Tai, Huafeng Li

TL;DR

Extensive experiments across diverse domains, including natural images, medical pathology, and remote sensing, validate the generalizability of the proposed SOTA, a training-free ensemble framework that integrates the outputs of multiple foundation models(VFMs or VLMs) by learning a self-adaptive transport plan.

Abstract

Foundation models have attracted widespread attention across domains due to their powerful zero-shot classification capabilities. This work is motivated by two key observations: (1) \textit{Vision-Language Models} (VLMs), such as CLIP, often over-rely on class-level textual priors and struggle to capture fine-grained visual cues, whereas \textit{Vision-only Foundation Models} (VFMs), such as DINO, provide rich and discriminative visual features but lack semantic alignment; (2) the performance of different VLMs varies considerably across datasets owing to differences in pre-training. To address these challenges, we propose \textbf{SOTA} (\textit{Self-adaptive Optimal TrAnsport}), a \textit{training-free} ensemble framework that integrates the outputs of multiple foundation models~(VFMs or VLMs) by learning a self-adaptive transport plan. Notably, \textbf{SOTA} is prior-free and automatically balances model contributions. Extensive experiments across diverse domains, including natural images, medical pathology, and remote sensing, validate the generalizability of \textbf{SOTA}. The results consistently show that it effectively leverages the complementary strengths of different foundation models and achieves substantial improvements over individual models. The implementation code is available at: https://github.com/Afleve/self-adaptive-Optimal-Transport.

SOTA: Self-adaptive Optimal Transport for Zero-Shot Classification with Multiple Foundation Models

TL;DR

Extensive experiments across diverse domains, including natural images, medical pathology, and remote sensing, validate the generalizability of the proposed SOTA, a training-free ensemble framework that integrates the outputs of multiple foundation models(VFMs or VLMs) by learning a self-adaptive transport plan.

Abstract

Foundation models have attracted widespread attention across domains due to their powerful zero-shot classification capabilities. This work is motivated by two key observations: (1) \textit{Vision-Language Models} (VLMs), such as CLIP, often over-rely on class-level textual priors and struggle to capture fine-grained visual cues, whereas \textit{Vision-only Foundation Models} (VFMs), such as DINO, provide rich and discriminative visual features but lack semantic alignment; (2) the performance of different VLMs varies considerably across datasets owing to differences in pre-training. To address these challenges, we propose \textbf{SOTA} (\textit{Self-adaptive Optimal TrAnsport}), a \textit{training-free} ensemble framework that integrates the outputs of multiple foundation models~(VFMs or VLMs) by learning a self-adaptive transport plan. Notably, \textbf{SOTA} is prior-free and automatically balances model contributions. Extensive experiments across diverse domains, including natural images, medical pathology, and remote sensing, validate the generalizability of \textbf{SOTA}. The results consistently show that it effectively leverages the complementary strengths of different foundation models and achieves substantial improvements over individual models. The implementation code is available at: https://github.com/Afleve/self-adaptive-Optimal-Transport.

Paper Structure

This paper contains 39 sections, 1 theorem, 13 equations, 7 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

Let $f(x) = x^2$ be a convex function and let $x^{(k)} \in \mathbb{R}$ be a given point. Then the first-order Taylor expansion of $f(x)$ at $x^{(k)}$ yields the global lower bound with equality if and only if $x = x^{(k)}$. Moreover, this affine function serves as a valid minorizer of $f(x)$ that can be maximized in iterative optimization schemes.

Figures (7)

  • Figure 1: Clustering accuracy comparison of visual features extracted from different foundation models. Compared with VLMs, VFMs produce more discriminative representations, especially on fine-grained datasets such as StanfordCars, Flower102, and Pets.
  • Figure 2: Top-1 accuracy of SOTA in zero-shot classification across three domains. The performance of individual VLMs varies notably across datasets, while SOTA effectively exploits their complementary strengths, yielding substantial improvements.
  • Figure 3: Pipeline of our method SOTA. We adopt a self-adaptive optimal transport strategy to integrate the outputs of diverse foundation models (VFMs or VLMs), yielding a transport plan $\mathbf{T}$. In the transductive setting, $\mathbf{T}$ directly serves as the final prediction. In the inductive setting, $\mathbf{T}$ guides the estimation of GMM parameters $\Theta$, which form one or multiple visual classifiers that collaborate with the text classifier to produce predictions on unseen test data.
  • Figure 4: Ablation experiments showing top-1 accuracy (%) across three dataset types: each group consists of three bars (from left to right: natural, remote sensing, and medical images). The bars correspond to the following settings: Base, Only-$\hat{\mathbf{P}}_v$, Non Self-adaptive, Disjoint-learning, and SOTA.
  • Figure 5: Convergence curves on 11 natural datasets. Our method often converges within a few iterations.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Theorem 1