Table of Contents
Fetching ...

Craft: Cross-modal Aligned Features Improve Robustness of Prompt Tuning

Jingchen Sun, Rohan Sharma, Vishnu Suresh Lokhande, Changyou Chen

TL;DR

This paper tackles prompt-tuning overfitting in vision–language models under limited data and distribution shift by introducing Craft, a cross-modal feature alignment framework that uses static and stochastic anchors drawn from the opposite modality to regularize prompts via an alignment loss $\mathcal{L}_{\text{Aligned}}$ and an anchor-aligned MMD loss $\mathcal{L}_{\text{MMD}}$. Anchors stabilize the latent space across text and image modalities, creating a unified cross-modal representation; the induced anchor measure $\mathbb{P}_x^{a_y}$ enables feasible MMD computation in the anchor space with a Gaussian kernel. Empirically, Craft improves Base-to-Novel generalization, reduces group robustness gaps, and enhances out-of-distribution recognition across 11 datasets and four prompt-tuning structures, with gains up to 6.1, 5.8, and 2.7 percentage points respectively. Ablation studies corroborate the contributions of static/stochastic anchors and MMD, while visualizations show clearer, more discriminative latent spaces. The approach offers a practical, plug-in regularization for robust visual-language prompt tuning with broad implications for transfer and OOD performance.

Abstract

Prompt Tuning has emerged as a prominent research paradigm for adapting vision-language models to various downstream tasks. However, recent research indicates that prompt tuning methods often lead to overfitting due to limited training samples. In this paper, we propose a Cross-modal Aligned Feature Tuning (Craft) method to address this issue. Cross-modal alignment is conducted by first selecting anchors from the alternative domain and deriving relative representations of the embeddings for the selected anchors. Optimizing for a feature alignment loss over anchor-aligned text and image modalities creates a more unified text-image common space. Overfitting in prompt tuning also deteriorates model performance on out-of-distribution samples. To further improve the prompt model's robustness, we propose minimizing Maximum Mean Discrepancy (MMD) over the anchor-aligned feature spaces to mitigate domain shift. The experiment on four different prompt tuning structures consistently shows the improvement of our method, with increases of up to $6.1\%$ in the Base-to-Novel generalization task, $5.8\%$ in the group robustness task, and $2.7\%$ in the out-of-distribution tasks. The code will be available at https://github.com/Jingchensun/Craft

Craft: Cross-modal Aligned Features Improve Robustness of Prompt Tuning

TL;DR

This paper tackles prompt-tuning overfitting in vision–language models under limited data and distribution shift by introducing Craft, a cross-modal feature alignment framework that uses static and stochastic anchors drawn from the opposite modality to regularize prompts via an alignment loss and an anchor-aligned MMD loss . Anchors stabilize the latent space across text and image modalities, creating a unified cross-modal representation; the induced anchor measure enables feasible MMD computation in the anchor space with a Gaussian kernel. Empirically, Craft improves Base-to-Novel generalization, reduces group robustness gaps, and enhances out-of-distribution recognition across 11 datasets and four prompt-tuning structures, with gains up to 6.1, 5.8, and 2.7 percentage points respectively. Ablation studies corroborate the contributions of static/stochastic anchors and MMD, while visualizations show clearer, more discriminative latent spaces. The approach offers a practical, plug-in regularization for robust visual-language prompt tuning with broad implications for transfer and OOD performance.

Abstract

Prompt Tuning has emerged as a prominent research paradigm for adapting vision-language models to various downstream tasks. However, recent research indicates that prompt tuning methods often lead to overfitting due to limited training samples. In this paper, we propose a Cross-modal Aligned Feature Tuning (Craft) method to address this issue. Cross-modal alignment is conducted by first selecting anchors from the alternative domain and deriving relative representations of the embeddings for the selected anchors. Optimizing for a feature alignment loss over anchor-aligned text and image modalities creates a more unified text-image common space. Overfitting in prompt tuning also deteriorates model performance on out-of-distribution samples. To further improve the prompt model's robustness, we propose minimizing Maximum Mean Discrepancy (MMD) over the anchor-aligned feature spaces to mitigate domain shift. The experiment on four different prompt tuning structures consistently shows the improvement of our method, with increases of up to in the Base-to-Novel generalization task, in the group robustness task, and in the out-of-distribution tasks. The code will be available at https://github.com/Jingchensun/Craft
Paper Structure (30 sections, 2 theorems, 7 equations, 7 figures, 10 tables)

This paper contains 30 sections, 2 theorems, 7 equations, 7 figures, 10 tables.

Key Result

Lemma 2

$(\Omega_x, \mathcal{F}_x)$ is a measurable space, and $\mathbb{P}^{\text{id}}_x$ and $\mathbb{P}^{\text{ood}}_x$ are two borel probability measures for in-domain and out-of-domain imaging data. Then $\mathbb{P}^{\text{id}}_x = \mathbb{P}^{\text{ood}}_x$ if and only if $\mathbb{E}_{x_\text{id}}(f(x_

Figures (7)

  • Figure 1: Illustration of Our Proposed Cross-Modal Feature Alignment Method. Firstly k-means clustering is conducted image embeddings to obtain static image anchors. Static text anchors are derived from the class-text labels. Simultaneously, we construct batch-level image and text samples to create stochastic image and text anchors. Static image anchors are aligned with stochastic text anchors using Equation \ref{['align-image']}. Additionally, stochastic text anchors and stochastic image anchors are aligned with each other by Equation \ref{['l-aligned']}. To address out-of-distribution samples, we apply the Maximum Mean Discrepancy (MMD) method to the aligned features, ensuring consistency within the latent space.
  • Figure 2: The ablation study on individual datasets.$\mathcal{L}_{\text{Baseline}}$ refers to the use of text-based cross-entropy loss in the method. Figures (a) and (b) demonstrate that adding $\mathcal{L}_{\text{Aligned (Static)}}$ or $\mathcal{L}_{\text{Aligned (Stochastic)}}$ can complementarily improve accuracy in in-distribution tasks. Figures (c) and (d) show that adding $\mathcal{L}_{\text{MMD}}$ further enhances accuracy across out-of-distribution tasks.
  • Figure 3: The effectiveness analysis on channel importance ratio distribution. The Oracle model is trained on the combination of labeled source and labeled target data, while our model is trained on the labeled source and unlabeled target data. Our $\mathcal{L}_{\text{MMD}}$ mitigate the domain shift of the baseline method and push the channel importance distribution close to the Oracle model.
  • Figure 4: The t-SNE Visualization of Latent Embeddings. The arrows in the three sub-figures illustrate our method can push the boundary between the two categories further apart. The circles in Figures (a) and (b) demonstrate that our method can separate the overlapping features of the two categories away from each other. The circle in Figure (c) shows that our method can achieve a more compact feature space.
  • Figure 5: Comparison of Prediction Probabilities With and Without Our Method. Our robust prompt tuning method effectively corrects misclassifications made by the baseline method.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Remark 1
  • Lemma 2
  • Definition 4
  • Lemma 5