Table of Contents
Fetching ...

Understanding Prompt Tuning for V-L Models Through the Lens of Neural Collapse

Didi Zhu, Zexi Li, Min Zhang, Junkun Yuan, Yunfeng Shao, Jiashuo Liu, Kun Kuang, Yinchuan Li, Chao Wu

TL;DR

This work extends the neural collapse perspective from vision-only models to vision-language prompts by analyzing how text and image representations align to a simplex ETF. It introduces Neural-collapse-anchored Prompt Tuning (NPT) with two regularizers—language-modality collapse and multi-modality isomorphism—to enforce ETF-like structure under class imbalance. Theoretical gradient insights and extensive experiments show that NC-informed regularization improves base-to-novel generalization, cross-dataset transfer, and domain generalization across 11 datasets, while remaining compatible with existing prompt-tuning methods. The results highlight ETF structure as a diagnostic and design principle for robust V-L model generalization under imbalance.

Abstract

Large-scale vision-language (V-L) models have demonstrated remarkable generalization capabilities for downstream tasks through prompt tuning. However, the mechanisms behind the learned text representations are unknown, limiting further generalization gains, especially under class imbalance scenarios. Recent advances in the neural collapse (NC) phenomenon of vision-only models suggest that the optimal representation structure is the simplex ETF, which paves the way to study representations in V-L models. In this paper, we make the first attempt to use NC for examining the representations in V-L models via prompt tuning. It is found that NC optimality of text-to-image representations shows a positive correlation with downstream generalizability, which is more severe under class imbalance settings. To improve the representations, we propose Neural-collapse-anchored Prompt Tuning (NPT), a novel method that learns prompts with text and image representations that satisfy the same simplex ETF. NPT incorporates two regularization terms: language-modality collapse and multi-modality isomorphism; and it is compatible with other prompt tuning methods. Extensive experiments show that NPT can consistently help to improve existing prompt tuning techniques across 11 datasets for both balanced and imbalanced settings.

Understanding Prompt Tuning for V-L Models Through the Lens of Neural Collapse

TL;DR

This work extends the neural collapse perspective from vision-only models to vision-language prompts by analyzing how text and image representations align to a simplex ETF. It introduces Neural-collapse-anchored Prompt Tuning (NPT) with two regularizers—language-modality collapse and multi-modality isomorphism—to enforce ETF-like structure under class imbalance. Theoretical gradient insights and extensive experiments show that NC-informed regularization improves base-to-novel generalization, cross-dataset transfer, and domain generalization across 11 datasets, while remaining compatible with existing prompt-tuning methods. The results highlight ETF structure as a diagnostic and design principle for robust V-L model generalization under imbalance.

Abstract

Large-scale vision-language (V-L) models have demonstrated remarkable generalization capabilities for downstream tasks through prompt tuning. However, the mechanisms behind the learned text representations are unknown, limiting further generalization gains, especially under class imbalance scenarios. Recent advances in the neural collapse (NC) phenomenon of vision-only models suggest that the optimal representation structure is the simplex ETF, which paves the way to study representations in V-L models. In this paper, we make the first attempt to use NC for examining the representations in V-L models via prompt tuning. It is found that NC optimality of text-to-image representations shows a positive correlation with downstream generalizability, which is more severe under class imbalance settings. To improve the representations, we propose Neural-collapse-anchored Prompt Tuning (NPT), a novel method that learns prompts with text and image representations that satisfy the same simplex ETF. NPT incorporates two regularization terms: language-modality collapse and multi-modality isomorphism; and it is compatible with other prompt tuning methods. Extensive experiments show that NPT can consistently help to improve existing prompt tuning techniques across 11 datasets for both balanced and imbalanced settings.
Paper Structure (16 sections, 13 equations, 5 figures, 4 tables)

This paper contains 16 sections, 13 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Neural collapse degree visualization in CLIP on the EuroSAT dataset. Arrows indicate text representations, points denote image representations, and colors indicate categories. (a-b) Under balanced conditions, the SOTA MaPLe method amplifies CLIP's generalizability by intensifying the neural collapse degree. (c-d) Under imbalance, the optimal structure in MaPLe is disrupted, resulting in a performance drop. Our NPT method can refine this structure, improving generalization.
  • Figure 2: Comparison of $\Delta_\text{MID}$ and $\Delta_\text{LCD}$ with generalization performance under the base-to-novel task.$\tau$ denotes imbalance degree: $\tau = 1$ for balanced data and $\tau = 0.01$ for high imbalance. (a) In balanced settings, we compare Zero-shot CLIP with current soft prompt tuning methods, finding a direct correlation between $\Delta_\text{MID}$ & $\Delta_\text{LCD}$ errors (the smaller values, the greater neural collapse) and accuracy. (b) This correlation holds in class imbalance scenarios, where the degree of neural collapse and generalization performance are more impaired, even for the SOTA method MaPLe. Our approach effectively bolsters CLIP's generalization in class imbalance settings by minimizing $\Delta_\text{LCD}$ and $\Delta_\text{MID}$, thereby avoiding significant performance declines.
  • Figure 3: Overview of Neural-collapse-anchored Prompt Tuning (NPT). Our method capitalizes on the benefits of two distinct regularizers: LC Regularizer $\mathcal{L}_\text{LC}$, which controls the increase in $\Delta_\text{LCD}$ values and fosters the generation of more discriminative textual representations; and MI Regularizer $\mathcal{L}_\text{MI}$, which promotes enhanced multi-modal alignment to address challenges and attain a reduced $\Delta_\text{MID}$.
  • Figure 4: Absolute improvement of NPT over MaPLe in the base-to-novel generalization task. Compared with MaPLe, our method achieves improvement for both base and new classes on all datasets with $\tau = 0.05$ and $\tau = 0.01$.
  • Figure 5: Sensitivity analysis of $w_1$ and $w_2$ under base-to-novel task with $\tau = 0.01$.

Theorems & Definitions (6)

  • Definition 1: Simplex Equiangular Tight Frame
  • Definition 2: Language-modality Collapse Degree
  • Definition 3: Multi-modality Isomorphism Degree
  • Remark 1
  • Remark 2
  • Remark 3