Craft: Cross-modal Aligned Features Improve Robustness of Prompt Tuning
Jingchen Sun, Rohan Sharma, Vishnu Suresh Lokhande, Changyou Chen
TL;DR
This paper tackles prompt-tuning overfitting in vision–language models under limited data and distribution shift by introducing Craft, a cross-modal feature alignment framework that uses static and stochastic anchors drawn from the opposite modality to regularize prompts via an alignment loss $\mathcal{L}_{\text{Aligned}}$ and an anchor-aligned MMD loss $\mathcal{L}_{\text{MMD}}$. Anchors stabilize the latent space across text and image modalities, creating a unified cross-modal representation; the induced anchor measure $\mathbb{P}_x^{a_y}$ enables feasible MMD computation in the anchor space with a Gaussian kernel. Empirically, Craft improves Base-to-Novel generalization, reduces group robustness gaps, and enhances out-of-distribution recognition across 11 datasets and four prompt-tuning structures, with gains up to 6.1, 5.8, and 2.7 percentage points respectively. Ablation studies corroborate the contributions of static/stochastic anchors and MMD, while visualizations show clearer, more discriminative latent spaces. The approach offers a practical, plug-in regularization for robust visual-language prompt tuning with broad implications for transfer and OOD performance.
Abstract
Prompt Tuning has emerged as a prominent research paradigm for adapting vision-language models to various downstream tasks. However, recent research indicates that prompt tuning methods often lead to overfitting due to limited training samples. In this paper, we propose a Cross-modal Aligned Feature Tuning (Craft) method to address this issue. Cross-modal alignment is conducted by first selecting anchors from the alternative domain and deriving relative representations of the embeddings for the selected anchors. Optimizing for a feature alignment loss over anchor-aligned text and image modalities creates a more unified text-image common space. Overfitting in prompt tuning also deteriorates model performance on out-of-distribution samples. To further improve the prompt model's robustness, we propose minimizing Maximum Mean Discrepancy (MMD) over the anchor-aligned feature spaces to mitigate domain shift. The experiment on four different prompt tuning structures consistently shows the improvement of our method, with increases of up to $6.1\%$ in the Base-to-Novel generalization task, $5.8\%$ in the group robustness task, and $2.7\%$ in the out-of-distribution tasks. The code will be available at https://github.com/Jingchensun/Craft
