Table of Contents
Fetching ...

Gram-Anchored Prompt Learning for Vision-Language Models via Second-Order Statistics

Minglei Chen, Weilong Wang, Jiang Duan, Ye Deng

Abstract

Parameter-efficient prompt learning has become the de facto standard for adapting Vision-Language Models (VLMs) to downstream tasks. Existing approaches predominantly focus on aligning text prompts with first-order visual features (i.e., spatial feature maps). While effective for fine-grained semantic discrimination, we argue that relying solely on first-order information is insufficient for robust adaptation, as these spatially entangled features are highly susceptible to domain shifts and local noise. In this work, we propose \textbf{Gram-Anchored Prompt Learning (GAPL)} for Vision-Language Models via Second-Order Statistics, a framework that synergizes local semantic alignment with global structural consistency. Methodologically, we introduce an additional second-order statistical stream via \textbf{Gram matrices} that augments the standard first-order spatial interaction. By anchoring prompts to these second-order priors, our approach enables language representations to dynamically adapt to statistical distribution shifts across diverse domains. Extensive experiments indicate the effectiveness of the second-order features, and show compelling performances of GAPL on various benchmarks.

Gram-Anchored Prompt Learning for Vision-Language Models via Second-Order Statistics

Abstract

Parameter-efficient prompt learning has become the de facto standard for adapting Vision-Language Models (VLMs) to downstream tasks. Existing approaches predominantly focus on aligning text prompts with first-order visual features (i.e., spatial feature maps). While effective for fine-grained semantic discrimination, we argue that relying solely on first-order information is insufficient for robust adaptation, as these spatially entangled features are highly susceptible to domain shifts and local noise. In this work, we propose \textbf{Gram-Anchored Prompt Learning (GAPL)} for Vision-Language Models via Second-Order Statistics, a framework that synergizes local semantic alignment with global structural consistency. Methodologically, we introduce an additional second-order statistical stream via \textbf{Gram matrices} that augments the standard first-order spatial interaction. By anchoring prompts to these second-order priors, our approach enables language representations to dynamically adapt to statistical distribution shifts across diverse domains. Extensive experiments indicate the effectiveness of the second-order features, and show compelling performances of GAPL on various benchmarks.

Paper Structure

This paper contains 18 sections, 17 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Illustration of first-order and anchored feature spaces. Under domain shift (e.g., photo vs. sketch), first-order conditioning may produce separated representations for the same semantic concept, whereas Gram-based anchoring introduces a second-order cue that brings them closer in the feature space.
  • Figure 2: Overview of the proposed Gram-Anchored Prompt Learning (GAPL) framework. Learnable prompt tokens are inserted into the text encoder, and only the input-layer prompts are shown for clarity. GAPL contains three branches: (1) a Global Invariant Stream that aligns the prompted text feature with the global visual feature from the CLS token; (2) a Gram-Anchored Stream (purple), the core component of our method, which extracts a Gram-based second-order cue from patch tokens and uses a Gram-based Style Modulator to generate a Style Text Anchor; and (3) a Contextual-Anchored Stream (green), which uses learnable local signals to produce Contextual Text Anchors for fine-grained alignment. The three streams are optimized jointly for robust prompt adaptation across domains.
  • Figure 3: Detailed architecture of the Gram-based Style Modulator and the Contextual Modulator. Left: the Gram-based Style Modulator receives a Gram-based second-order cue derived from patch tokens, keeps their diagonal Gram features as a compact image-level descriptor, and converts them into a gating vector for modulating the prompted text feature. Right: the Contextual Modulator uses the prompted text feature as a query to interact with learnable local signals, producing contextual text anchors for fine-grained alignment.
  • Figure 4: Visual specificity comparison (query point on the dog's ear).
  • Figure 5: t-SNE visualization of latent manifolds across four domains for five selected classes. (a) CLS Token (first-order global) exhibits severe domain divergence with scattered distributions. (b) Contextual Text Anchor (first-order local) remains susceptible to stylistic noise despite local patch aggregation. (c) Style Text Anchor (ours, second-order) achieves superior alignment by collapsing domain-specific variance into compact, class-discriminative clusters.
  • ...and 1 more figures