Table of Contents
Fetching ...

Local-Global Prompt Learning via Sparse Optimal Transport

Deniz Kizaroğlu, Ülku Tuncer Küçüktas, Emre Çakmakyurdu, Alptekin Temizel

TL;DR

SOT-GLP is proposed, which introduces a shared sparse patch support and balanced optimal transport allocation to explicitly partition salient visual regions among class-specific local prompts while preserving global alignment and identifies a distinct accuracy-robustness trade-off in prompt learning.

Abstract

Few-shot adaptation of vision-language models (VLMs) like CLIP typically relies on learning textual prompts matched to global image embeddings. Recent works extend this paradigm by incorporating local image-text alignment to capture fine-grained visual cues, yet these approaches often select local regions independently for each prompt, leading to redundant local feature usage and prompt overlap. We propose SOT-GLP, which introduces a shared sparse patch support and balanced optimal transport allocation to explicitly partition salient visual regions among class-specific local prompts while preserving global alignment. Our method learns shared global prompts and class-specific local prompts. The global branch maintains standard image-text matching for robust category-level alignment. The local branch constructs a class-conditioned sparse patch set using V-V attention and aligns it to multiple class-specific prompts via balanced entropic optimal transport, yielding a soft partition of patches that prevents prompt overlap and collapse. We evaluate our method on two complementary objectives: (i) few-shot classification accuracy on 11 standard benchmarks and (ii) out-of-distribution (OOD) detection. On the standard 11-dataset benchmark with 16-shot ViT-B/16, SOT-GLP achieves 85.1% average accuracy, outperforming prior prompt-learning methods. We identify a distinct accuracy-robustness trade-off in prompt learning: while learnable projections optimize in-distribution fit, they alter the foundational feature space. We demonstrate that a projection-free local alignment preserves the native geometry of the CLIP manifold, yielding state-of-the-art OOD detection performance (94.2% AUC) that surpasses fully adapted models. Implementation available at: https://github.com/Deniz2304988/SOT-GLP

Local-Global Prompt Learning via Sparse Optimal Transport

TL;DR

SOT-GLP is proposed, which introduces a shared sparse patch support and balanced optimal transport allocation to explicitly partition salient visual regions among class-specific local prompts while preserving global alignment and identifies a distinct accuracy-robustness trade-off in prompt learning.

Abstract

Few-shot adaptation of vision-language models (VLMs) like CLIP typically relies on learning textual prompts matched to global image embeddings. Recent works extend this paradigm by incorporating local image-text alignment to capture fine-grained visual cues, yet these approaches often select local regions independently for each prompt, leading to redundant local feature usage and prompt overlap. We propose SOT-GLP, which introduces a shared sparse patch support and balanced optimal transport allocation to explicitly partition salient visual regions among class-specific local prompts while preserving global alignment. Our method learns shared global prompts and class-specific local prompts. The global branch maintains standard image-text matching for robust category-level alignment. The local branch constructs a class-conditioned sparse patch set using V-V attention and aligns it to multiple class-specific prompts via balanced entropic optimal transport, yielding a soft partition of patches that prevents prompt overlap and collapse. We evaluate our method on two complementary objectives: (i) few-shot classification accuracy on 11 standard benchmarks and (ii) out-of-distribution (OOD) detection. On the standard 11-dataset benchmark with 16-shot ViT-B/16, SOT-GLP achieves 85.1% average accuracy, outperforming prior prompt-learning methods. We identify a distinct accuracy-robustness trade-off in prompt learning: while learnable projections optimize in-distribution fit, they alter the foundational feature space. We demonstrate that a projection-free local alignment preserves the native geometry of the CLIP manifold, yielding state-of-the-art OOD detection performance (94.2% AUC) that surpasses fully adapted models. Implementation available at: https://github.com/Deniz2304988/SOT-GLP
Paper Structure (11 sections, 11 equations, 3 figures, 4 tables)

This paper contains 11 sections, 11 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of the SOT-GLP framework. The image encoder uses two parallel streams: a standard CLIP pathway (Q-K attention) for global features and a V-V attention pathway for local patch features. Global prompts generate class-agnostic text embeddings matched to the global [CLS] token, while class-specific local prompts align to Top-$K$ selected patches via Optimal Transport. Both branches contribute to the final classification logits.
  • Figure 2: Learned Local Prompt Specialization. Patch–prompt similarity maps for two example classes. For each individual prompt, the top-3 patches are shown. The "combined prompts" column shows similarity between the mean of the local prompts and patch features, computed over top-10 patches. Optimal Transport constraint promotes prompt specialization on distinct visual regions, while their mean representation covers the most salient parts of the object.
  • Figure 3: Left: Avg. FPR95 vs. ImageNet accuracy. Right: Avg. AUC vs. ImageNet accuracy. Our method (blue star) is in the high-accuracy / low-FPR95 corner (75.5, 28.1), achieving very similar FPR95 and AUC to GalLoP but at higher ImageNet accuracy, while the variant without local projection (orange star) further lowers FPR95 to 23.8 and increases AUC score to 94.2 with only a small drop in accuracy. No local projection variant shows state of art OOD performance compared to the other prompt-learning baselines, with FPR95 gains ranging from $\downarrow 4.9$ (vs. LoCoOp--GL, $28.7 \to 23.8$) to $\downarrow 3.5$ (vs. Gallop, $27.3 \to 23.8$), and AUC gains of $\uparrow 0.7$ (vs. LoCoOp (GL-MCM), $93.5 \to 94.2$) up to $\uparrow 1.0$ (vs. Gallop, $93.2 \to 94.2$).