Table of Contents
Fetching ...

Robust Prompt Tuning for Vision-Language Models with Mild Semantic Noise

Yansheng Gao, Yufei Zheng, Shengsheng Wang

TL;DR

ANPrompt addresses brittleness in prompt-tuning for open-vocabulary vision-language tasks by actively injecting weak semantic perturbations into both text prompts and the visual pathway. The framework combines Weak Noise Frozen Text Features, Anti-noise Prompt construction, a Noise-Resistant Visual Prompt Prototype (NRVPP), and a variance-adaptive Weak Alignment Loss ($\mathcal{L}_{WA}$) to stabilize logits under perturbations. Empirical results across 11 datasets, including base-to-new splits and cross-domain scenarios, show superior robustness to semantic noise and improved generalization over strong-noise and noise-filtering baselines. This approach demonstrates that controlled exposure to semantic variation, rather than aggressive noise suppression, yields more reliable and transferable open-vocabulary recognition with prompt-tuning.

Abstract

Prompt tuning has shown promising results, but its robustness and generalization to unseen categories remain limited. Through our experiments, we demonstrate that the complete removal of semantic noise is a key factor restricting robustness. Existing methods typically suppress or filter out semantic noise in the prompt space, inadvertently hindering the model's robustness and its ability to generalize to unseen categories. To address this, we propose ANPrompt, a robust prompt tuning framework that actively incorporates weak semantic noise. By clustering weakly perturbed features into noise prompts and integrating them with learnable tokens in both the text and vision encoders, ANPrompt ensures controlled exposure to semantic variations. To enhance the visual pathway, we introduce the Noise-Resistant Visual Prompt Prototype (NRVPP), which stabilizes visual semantics under weak perturbations. Additionally, we propose a Weak Alignment Loss (WALoss) at the logits level to enforce consistency between clean and perturbed predictions, providing stable supervision. By combining weak semantic noise exposure with logits-based consistency, ANPrompt prevents overfitting to specific phrasings while preserving semantic integrity. Extensive experiments across 11 benchmarks, including base-to-new splits, show that ANPrompt consistently outperforms existing prompt tuning methods, offering superior robustness to semantic noise and improved generalization across tasks.

Robust Prompt Tuning for Vision-Language Models with Mild Semantic Noise

TL;DR

ANPrompt addresses brittleness in prompt-tuning for open-vocabulary vision-language tasks by actively injecting weak semantic perturbations into both text prompts and the visual pathway. The framework combines Weak Noise Frozen Text Features, Anti-noise Prompt construction, a Noise-Resistant Visual Prompt Prototype (NRVPP), and a variance-adaptive Weak Alignment Loss () to stabilize logits under perturbations. Empirical results across 11 datasets, including base-to-new splits and cross-domain scenarios, show superior robustness to semantic noise and improved generalization over strong-noise and noise-filtering baselines. This approach demonstrates that controlled exposure to semantic variation, rather than aggressive noise suppression, yields more reliable and transferable open-vocabulary recognition with prompt-tuning.

Abstract

Prompt tuning has shown promising results, but its robustness and generalization to unseen categories remain limited. Through our experiments, we demonstrate that the complete removal of semantic noise is a key factor restricting robustness. Existing methods typically suppress or filter out semantic noise in the prompt space, inadvertently hindering the model's robustness and its ability to generalize to unseen categories. To address this, we propose ANPrompt, a robust prompt tuning framework that actively incorporates weak semantic noise. By clustering weakly perturbed features into noise prompts and integrating them with learnable tokens in both the text and vision encoders, ANPrompt ensures controlled exposure to semantic variations. To enhance the visual pathway, we introduce the Noise-Resistant Visual Prompt Prototype (NRVPP), which stabilizes visual semantics under weak perturbations. Additionally, we propose a Weak Alignment Loss (WALoss) at the logits level to enforce consistency between clean and perturbed predictions, providing stable supervision. By combining weak semantic noise exposure with logits-based consistency, ANPrompt prevents overfitting to specific phrasings while preserving semantic integrity. Extensive experiments across 11 benchmarks, including base-to-new splits, show that ANPrompt consistently outperforms existing prompt tuning methods, offering superior robustness to semantic noise and improved generalization across tasks.

Paper Structure

This paper contains 32 sections, 9 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: This figure compares the accuracy of various methods, including DAPT, Synreplace, Mask, Shuffle, and Drop. The results show that incorporating weak semantic noise achieves superior performance, outperforming DAPT (79.35%) and other baseline methods. This highlights the effectiveness of weak semantic noise injection in enhancing model robustness and generalization.
  • Figure 2: Framework comparison with representative filtering-based methods such as DAPT and ArGue. While these approaches constrain frozen logits and risk overfitting under weak semantic perturbations, the proposed ANPrompt leverages weak semantic perturbations to generate soft logits and anti-noise prompts, thereby improving generalization.
  • Figure 3: Illustration of weak semantic perturbations in both image and text. Class activation maps (CAMs) may highlight irrelevant regions such as background or co-occurring objects, while subtle textual cues (e.g., "dark stripes") can shift class semantics from "cat" to "tiger". These examples demonstrate how weak semantic noise can mislead recognition and motivate the need for robust prompt learning.
  • Figure 4: Quantitative comparison of different noise strategies in terms of Text Shift (TS), Logit Preservation Rate (LPR), and Accuracy Shift (AS). The results show that strong perturbations (drop, mask, shuffle, synonym replacement) cause semantic distortion and performance degradation, whereas our weak semantic perturbations (WSP) maintain negligible shift, high prediction consistency, and nearly zero accuracy loss.
  • Figure 5: Overview of ANPrompt. For each class, a main and a noise sentence are sampled to construct the Weak Noise Frozen Text Feature, which is clustered into noise prompts. These are combined with learnable tokens to generate Anti-noise Prompts, which are injected into the last two layers of both vision and text encoders. During encoding, we obtain Prompted Image/Text Features and the Noise-Resistant Visual Prompt Prototype (NRVPP) to compute four types of logits, supervised by WALoss and auxiliary losses to improve robustness against weak semantic perturbations.
  • ...and 2 more figures