Table of Contents
Fetching ...

Generalizing Vision-Language Models with Dedicated Prompt Guidance

Xinyao Li, Yinjie Min, Hongbo Chen, Zhekai Du, Fengling Li, Jingjing Li

TL;DR

This work addresses domain generalization for fine-tuning large vision–language models by showing that an ensemble of domain-specific, parameter-efficient experts can generalize better to unseen domains than a single universal model. It proposes GuiDG, a two-step framework that first learns domain-expert prompts and then uses a Cross-Modal Attention module to adaptively fuse these experts during CLIP fine-tuning, with training and inference designed for efficiency. A theoretical upper bound supports the intuition that partitioned experts reduce target risk, complemented by a new ImageNet-DG benchmark for few-shot DG evaluation. Empirically, GuiDG achieves state-of-the-art results across standard DG benchmarks and ImageNet-DG, while introducing only about 1M extra parameters, demonstrating both effectiveness and practicality.

Abstract

Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we provide a theoretical understanding of the generalization ability for VLM fine-tuning, which reveals that training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. Inspired by this finding, we propose a two-step domain-expert-Guided DG (GuiDG) framework. GuiDG first employs prompt tuning to obtain source domain experts, then introduces a Cross-Modal Attention module to guide the fine-tuning of the vision encoder via adaptive expert integration. To better evaluate few-shot DG, we construct ImageNet-DG from ImageNet and its variants. Extensive experiments on standard DG benchmarks and ImageNet-DG demonstrate that GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.

Generalizing Vision-Language Models with Dedicated Prompt Guidance

TL;DR

This work addresses domain generalization for fine-tuning large vision–language models by showing that an ensemble of domain-specific, parameter-efficient experts can generalize better to unseen domains than a single universal model. It proposes GuiDG, a two-step framework that first learns domain-expert prompts and then uses a Cross-Modal Attention module to adaptively fuse these experts during CLIP fine-tuning, with training and inference designed for efficiency. A theoretical upper bound supports the intuition that partitioned experts reduce target risk, complemented by a new ImageNet-DG benchmark for few-shot DG evaluation. Empirically, GuiDG achieves state-of-the-art results across standard DG benchmarks and ImageNet-DG, while introducing only about 1M extra parameters, demonstrating both effectiveness and practicality.

Abstract

Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we provide a theoretical understanding of the generalization ability for VLM fine-tuning, which reveals that training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. Inspired by this finding, we propose a two-step domain-expert-Guided DG (GuiDG) framework. GuiDG first employs prompt tuning to obtain source domain experts, then introduces a Cross-Modal Attention module to guide the fine-tuning of the vision encoder via adaptive expert integration. To better evaluate few-shot DG, we construct ImageNet-DG from ImageNet and its variants. Extensive experiments on standard DG benchmarks and ImageNet-DG demonstrate that GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.

Paper Structure

This paper contains 21 sections, 3 theorems, 34 equations, 6 figures, 8 tables.

Key Result

Theorem 1

Assume hypothesis space $\mathcal{H}$, $\widetilde{\mathcal{H}}$ and $\mathcal{H}_i$ have VC-dimension $d_0$, $\widetilde{d}$ and $d_i$ respectively. There exists constant $C>0$, such that for any $\delta\in(0,1)$ with probability at least $1-3d\delta$, the following inequality hold: and denote $N=n+m$, with probability at least $1-d\delta$ the following inequality hold:

Figures (6)

  • Figure 1: Illustration of the specialization - generalization balance. ERM fine-tuning fits to source knowledge at the cost of generalization ability, while our GuiDG achieves consistent improvements on both seen and unseen domains.
  • Figure 2: The two-step GuiDG framework. In Step 1, we split source data according to their domain characteristics. On each domain, a domain expert is learned with off-the-shelf prompt tuning methods. In Step 2, all domain experts are frozen. A Cross-Modal Attention (CMAttn) module decides ensemble weights from vision and text representations. These weights aggregate the knowledge in domain experts to guide the fine-tuning of the vision encoder, and assemble predictions for inference.
  • Figure 3: Bars and lines are relative accuracies (average accuracy subtracted) and weights of domain experts. The rightmost bar in each group shows the gains by prompt ensemble.
  • Figure 4: Vision features of source and target data before and after fine-tuning. (a),(b) Results obtained on domain 'Art' of OfficeHome. (c),(d) Results obtained on domain 'clp' of DomainNet.
  • Figure 5: Few-shot results, averaged over all target domains.
  • ...and 1 more figures

Theorems & Definitions (6)

  • Theorem 1
  • Corollary 2
  • Remark 3
  • Lemma 4
  • proof : Proof of Theorem 1
  • proof : Proof of Corollary 2