Robust Prompt Tuning for Vision-Language Models with Mild Semantic Noise

Yansheng Gao; Yufei Zheng; Shengsheng Wang

Robust Prompt Tuning for Vision-Language Models with Mild Semantic Noise

Yansheng Gao, Yufei Zheng, Shengsheng Wang

TL;DR

ANPrompt addresses brittleness in prompt-tuning for open-vocabulary vision-language tasks by actively injecting weak semantic perturbations into both text prompts and the visual pathway. The framework combines Weak Noise Frozen Text Features, Anti-noise Prompt construction, a Noise-Resistant Visual Prompt Prototype (NRVPP), and a variance-adaptive Weak Alignment Loss ($\mathcal{L}_{WA}$) to stabilize logits under perturbations. Empirical results across 11 datasets, including base-to-new splits and cross-domain scenarios, show superior robustness to semantic noise and improved generalization over strong-noise and noise-filtering baselines. This approach demonstrates that controlled exposure to semantic variation, rather than aggressive noise suppression, yields more reliable and transferable open-vocabulary recognition with prompt-tuning.

Abstract

Prompt tuning has shown promising results, but its robustness and generalization to unseen categories remain limited. Through our experiments, we demonstrate that the complete removal of semantic noise is a key factor restricting robustness. Existing methods typically suppress or filter out semantic noise in the prompt space, inadvertently hindering the model's robustness and its ability to generalize to unseen categories. To address this, we propose ANPrompt, a robust prompt tuning framework that actively incorporates weak semantic noise. By clustering weakly perturbed features into noise prompts and integrating them with learnable tokens in both the text and vision encoders, ANPrompt ensures controlled exposure to semantic variations. To enhance the visual pathway, we introduce the Noise-Resistant Visual Prompt Prototype (NRVPP), which stabilizes visual semantics under weak perturbations. Additionally, we propose a Weak Alignment Loss (WALoss) at the logits level to enforce consistency between clean and perturbed predictions, providing stable supervision. By combining weak semantic noise exposure with logits-based consistency, ANPrompt prevents overfitting to specific phrasings while preserving semantic integrity. Extensive experiments across 11 benchmarks, including base-to-new splits, show that ANPrompt consistently outperforms existing prompt tuning methods, offering superior robustness to semantic noise and improved generalization across tasks.

Robust Prompt Tuning for Vision-Language Models with Mild Semantic Noise

TL;DR

Abstract

Robust Prompt Tuning for Vision-Language Models with Mild Semantic Noise

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)