Generalizing Vision-Language Models with Dedicated Prompt Guidance

Xinyao Li; Yinjie Min; Hongbo Chen; Zhekai Du; Fengling Li; Jingjing Li

Generalizing Vision-Language Models with Dedicated Prompt Guidance

Xinyao Li, Yinjie Min, Hongbo Chen, Zhekai Du, Fengling Li, Jingjing Li

TL;DR

This work addresses domain generalization for fine-tuning large vision–language models by showing that an ensemble of domain-specific, parameter-efficient experts can generalize better to unseen domains than a single universal model. It proposes GuiDG, a two-step framework that first learns domain-expert prompts and then uses a Cross-Modal Attention module to adaptively fuse these experts during CLIP fine-tuning, with training and inference designed for efficiency. A theoretical upper bound supports the intuition that partitioned experts reduce target risk, complemented by a new ImageNet-DG benchmark for few-shot DG evaluation. Empirically, GuiDG achieves state-of-the-art results across standard DG benchmarks and ImageNet-DG, while introducing only about 1M extra parameters, demonstrating both effectiveness and practicality.

Abstract

Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we provide a theoretical understanding of the generalization ability for VLM fine-tuning, which reveals that training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. Inspired by this finding, we propose a two-step domain-expert-Guided DG (GuiDG) framework. GuiDG first employs prompt tuning to obtain source domain experts, then introduces a Cross-Modal Attention module to guide the fine-tuning of the vision encoder via adaptive expert integration. To better evaluate few-shot DG, we construct ImageNet-DG from ImageNet and its variants. Extensive experiments on standard DG benchmarks and ImageNet-DG demonstrate that GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.

Generalizing Vision-Language Models with Dedicated Prompt Guidance

TL;DR

Abstract

Generalizing Vision-Language Models with Dedicated Prompt Guidance

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (6)