Table of Contents
Fetching ...

A Retrospect to Multi-prompt Learning across Vision and Language

Ziliang Chen, Xin Huang, Quanlong Guan, Liang Lin, Weiqi Luo

TL;DR

This work analyzes why multi-prompt learning can outperform single-prompt approaches in vision–language models and proposes EMPL, an energy-based framework that samples multiple prompts conditioned on image features to balance in-domain accuracy with open-vocabulary generalization. By framing prompts as an energy-based distribution and optimizing with a meta-learning objective, EMPL reduces cross-modal modality gaps and mitigates cross-domain and cross-dataset generalization challenges. The authors provide both theoretical justification (modality gap and non-identifiability) and extensive empirical validation across base-to-new, cross-domain, and cross-dataset tasks, showing consistent gains without adding parameter count to CLIP-style backbones. The approach offers a practical, scalable path to robust open-vocabulary vision–language understanding, with notable improvements in few-shot retrieval and transfer scenarios, albeit with higher inference cost that invites future efficiency-focused work.

Abstract

The vision community is undergoing the unprecedented progress with the emergence of Vision-Language Pretraining Models (VLMs). Prompt learning plays as the holy grail of accessing VLMs since it enables their fast adaptation to downstream tasks with limited resources. Whereas existing researches milling around single-prompt paradigms, rarely investigate the technical potential behind their multi-prompt learning counterparts. This paper aims to provide a principled retrospect for vision-language multi-prompt learning. We extend the recent constant modality gap phenomenon to learnable prompts and then, justify the superiority of vision-language transfer with multi-prompt augmentation, empirically and theoretically. In terms of this observation, we propose an Energy-based Multi-prompt Learning (EMPL) to generate multiple prompt embeddings by drawing instances from an energy-based distribution, which is implicitly defined by VLMs. So our EMPL is not only parameter-efficient but also rigorously lead to the balance between in-domain and out-of-domain open-vocabulary generalization. Comprehensive experiments have been conducted to justify our claims and the excellence of EMPL.

A Retrospect to Multi-prompt Learning across Vision and Language

TL;DR

This work analyzes why multi-prompt learning can outperform single-prompt approaches in vision–language models and proposes EMPL, an energy-based framework that samples multiple prompts conditioned on image features to balance in-domain accuracy with open-vocabulary generalization. By framing prompts as an energy-based distribution and optimizing with a meta-learning objective, EMPL reduces cross-modal modality gaps and mitigates cross-domain and cross-dataset generalization challenges. The authors provide both theoretical justification (modality gap and non-identifiability) and extensive empirical validation across base-to-new, cross-domain, and cross-dataset tasks, showing consistent gains without adding parameter count to CLIP-style backbones. The approach offers a practical, scalable path to robust open-vocabulary vision–language understanding, with notable improvements in few-shot retrieval and transfer scenarios, albeit with higher inference cost that invites future efficiency-focused work.

Abstract

The vision community is undergoing the unprecedented progress with the emergence of Vision-Language Pretraining Models (VLMs). Prompt learning plays as the holy grail of accessing VLMs since it enables their fast adaptation to downstream tasks with limited resources. Whereas existing researches milling around single-prompt paradigms, rarely investigate the technical potential behind their multi-prompt learning counterparts. This paper aims to provide a principled retrospect for vision-language multi-prompt learning. We extend the recent constant modality gap phenomenon to learnable prompts and then, justify the superiority of vision-language transfer with multi-prompt augmentation, empirically and theoretically. In terms of this observation, we propose an Energy-based Multi-prompt Learning (EMPL) to generate multiple prompt embeddings by drawing instances from an energy-based distribution, which is implicitly defined by VLMs. So our EMPL is not only parameter-efficient but also rigorously lead to the balance between in-domain and out-of-domain open-vocabulary generalization. Comprehensive experiments have been conducted to justify our claims and the excellence of EMPL.

Paper Structure

This paper contains 14 sections, 3 theorems, 9 equations, 7 figures, 5 tables.

Key Result

Proposition 1

Individual-level cross-modal non-identifiability (Informal) Suppose a single-prompt learning model $(f(\cdot),h_{\boldsymbol{v}}(\cdot))$ satisfies the constant individual-level modality gap. Given each pair of images $\boldsymbol{x}_1$, $\boldsymbol{x}_2$ with mutually exclusive concepts, it is not

Figures (7)

  • Figure 1: The overview of cross-modal single-prompt learning and multi-prompt learning (MPL). With more prompt templates, MPL brings new opportunities and challenges as discussed in the community, yet seldom giving a systematic investigation and solution.
  • Figure 2: Magnitude (M) and Direction (D) of individual modality gap (IMG) and class modality gap (CMG) on MsCOCO lin2014microsoft. We gradually increase the number of prompts by switching models as CLIP$\rightarrow$CoOp$\rightarrow$ProDA$\rightarrow$ProDA(x2)$\rightarrow$ProDA(x4), to observe the change of IMG and CMG.
  • Figure 3: The paradigm of EMPL (best viewed in color). Briefly speaking, EMPL defines a prompt distribution based upon a EBM with the variables lying in the image feature and prompt embedding spaces. It categorizes an image with multiple prompts iteratively drawn from the EBM-based distribution by SGLD samplers.
  • Figure 4: The performance change in base classes in 11 datasets.
  • Figure 5: The performance change in new classes in 11 datasets.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 2
  • Proposition 3