Table of Contents
Fetching ...

Evolving Prompt Adaptation for Vision-Language Models

Enming Zhang, Jiayang Li, Yanru Wu, Zhenyu Liu, Yang Li

TL;DR

EvoPrompt is a novel framework designed to explicitly steer the prompt trajectory for stable, knowledge-preserving fine-tuning, and achieves state-of-the-art performance in few-shot learning while robustly preserving the original zero-shot capabilities of pre-trained VLMs.

Abstract

The adaptation of large-scale vision-language models (VLMs) to downstream tasks with limited labeled data remains a significant challenge. While parameter-efficient prompt learning methods offer a promising path, they often suffer from catastrophic forgetting of pre-trained knowledge. Toward addressing this limitation, our work is grounded in the insight that governing the evolutionary path of prompts is essential for forgetting-free adaptation. To this end, we propose EvoPrompt, a novel framework designed to explicitly steer the prompt trajectory for stable, knowledge-preserving fine-tuning. Specifically, our approach employs a Modality-Shared Prompt Projector (MPP) to generate hierarchical prompts from a unified embedding space. Critically, an evolutionary training strategy decouples low-rank updates into directional and magnitude components, preserving early-learned semantic directions while only adapting their magnitude, thus enabling prompts to evolve without discarding foundational knowledge. This process is further stabilized by Feature Geometric Regularization (FGR), which enforces feature decorrelation to prevent representation collapse. Extensive experiments demonstrate that EvoPrompt achieves state-of-the-art performance in few-shot learning while robustly preserving the original zero-shot capabilities of pre-trained VLMs.

Evolving Prompt Adaptation for Vision-Language Models

TL;DR

EvoPrompt is a novel framework designed to explicitly steer the prompt trajectory for stable, knowledge-preserving fine-tuning, and achieves state-of-the-art performance in few-shot learning while robustly preserving the original zero-shot capabilities of pre-trained VLMs.

Abstract

The adaptation of large-scale vision-language models (VLMs) to downstream tasks with limited labeled data remains a significant challenge. While parameter-efficient prompt learning methods offer a promising path, they often suffer from catastrophic forgetting of pre-trained knowledge. Toward addressing this limitation, our work is grounded in the insight that governing the evolutionary path of prompts is essential for forgetting-free adaptation. To this end, we propose EvoPrompt, a novel framework designed to explicitly steer the prompt trajectory for stable, knowledge-preserving fine-tuning. Specifically, our approach employs a Modality-Shared Prompt Projector (MPP) to generate hierarchical prompts from a unified embedding space. Critically, an evolutionary training strategy decouples low-rank updates into directional and magnitude components, preserving early-learned semantic directions while only adapting their magnitude, thus enabling prompts to evolve without discarding foundational knowledge. This process is further stabilized by Feature Geometric Regularization (FGR), which enforces feature decorrelation to prevent representation collapse. Extensive experiments demonstrate that EvoPrompt achieves state-of-the-art performance in few-shot learning while robustly preserving the original zero-shot capabilities of pre-trained VLMs.
Paper Structure (33 sections, 1 theorem, 14 equations, 4 figures, 4 tables)

This paper contains 33 sections, 1 theorem, 14 equations, 4 figures, 4 tables.

Key Result

theorem 1

Let $\mathbf{u} \in \mathbb{R}^d$ and $\mathbf{v} \in \mathbb{R}^d$ be zero-mean random vectors. The Soft-HGR maximal correlation between $\mathbf{u}$ and $\mathbf{v}$ is the solution to the following optimization problem: subject to $\mathbb{E}[\phi(\mathbf{u})]=\mathbb{E}[\psi(\mathbf{v})]=\mathbf{0}$, where $\phi, \psi$ are measurable functions.

Figures (4)

  • Figure 1: Comparison of our proposed EvoPrompt frameworks with related representative efficient transfer learning for VLMs.
  • Figure 2: Overview of the proposed EvoPrompt framework. Left: Modality-shared projectors are used to inject prompts into dual encoders. Top-right: To enhance feature orthogonality, $\mathcal{L}_{fgr}$ transforms correlated representations into mutually independent vectors. Bottom-right: The low-rank adapter is decomposed into magnitude $\alpha_i$ and direction components, with historical directions frozen to preserve early geometric alignments, while the magnitudes remain trainable for later adaptation.
  • Figure 3: EvoPrompt performance comparison in few-shot image recognition setting.
  • Figure 4: Analysis of training dynamics and performance. (a) The evolution of learnable magnitudes $\alpha_i$. (b, c) Performance comparison between MaPLe and EvoPrompt, where vertical dashed lines indicate training breakpoints.

Theorems & Definitions (1)

  • theorem 1: Soft-HGR Maximal Correlation