Table of Contents
Fetching ...

Noise-aware few-shot learning through bi-directional multi-view prompt alignment

Lu Niu, Cheng Xue

Abstract

Vision-language models offer strong few-shot capability through prompt tuning but remain vulnerable to noisy labels, which can corrupt prompts and degrade cross-modal alignment. Existing approaches struggle because they often lack the ability to model fine-grained semantic cues and to adaptively separate clean from noisy signals. To address these challenges, we propose NA-MVP, a framework for Noise-Aware few-shot learning through bi-directional Multi-View Prompt alignment. NA-MVP is built upon a key conceptual shift: robust prompt learning requires moving from global matching to region-aware alignment that explicitly distinguishes clean cues from noisy ones. To realize this, NA-MVP employs (1) multi-view prompts combined with unbalanced optimal transport to achieve fine-grained patch-to-prompt correspondence while suppressing unreliable regions; (2) a bi-directional prompt design that captures complementary clean-oriented and noise-aware cues, enabling the model to focus on stable semantics; and (3) an alignment-guided selective refinement strategy that uses optimal transport to correct only mislabeled samples while retaining reliable data. Experiments on synthetic and real-world noisy benchmarks demonstrate that NA-MVP consistently outperforms state-of-the-art baselines, confirming its effectiveness in enabling robust few-shot learning under noisy supervision.

Noise-aware few-shot learning through bi-directional multi-view prompt alignment

Abstract

Vision-language models offer strong few-shot capability through prompt tuning but remain vulnerable to noisy labels, which can corrupt prompts and degrade cross-modal alignment. Existing approaches struggle because they often lack the ability to model fine-grained semantic cues and to adaptively separate clean from noisy signals. To address these challenges, we propose NA-MVP, a framework for Noise-Aware few-shot learning through bi-directional Multi-View Prompt alignment. NA-MVP is built upon a key conceptual shift: robust prompt learning requires moving from global matching to region-aware alignment that explicitly distinguishes clean cues from noisy ones. To realize this, NA-MVP employs (1) multi-view prompts combined with unbalanced optimal transport to achieve fine-grained patch-to-prompt correspondence while suppressing unreliable regions; (2) a bi-directional prompt design that captures complementary clean-oriented and noise-aware cues, enabling the model to focus on stable semantics; and (3) an alignment-guided selective refinement strategy that uses optimal transport to correct only mislabeled samples while retaining reliable data. Experiments on synthetic and real-world noisy benchmarks demonstrate that NA-MVP consistently outperforms state-of-the-art baselines, confirming its effectiveness in enabling robust few-shot learning under noisy supervision.
Paper Structure (29 sections, 17 equations, 4 figures, 4 tables)

This paper contains 29 sections, 17 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Limitations of existing prompt learning approaches under noisy labels. Single-view reliance: Limited prompts miss diverse visual patterns. Explicit negatives: Fixed negatives impose rigid supervision. Fixed threshold: Coarse denoising lets noise propagate.
  • Figure 2: Overview of the NA-MVP framework. Our framework consists of two key modules: (1) Noise-aware alignment (blue arrows): Multiple clean and noise-aware prompts per class are encoded and aligned with local image patches via UOT to generate clean/noisy probabilities. (2) Selective label refinement (green arrows): An adaptive threshold $\phi$ derived from these probabilities identifies mislabeled samples, which are refined via classical OT by aligning global image features with clean text features. The two modules work together to iteratively update the training set while optimizing the prompts, producing a denoised dataset for robust prediction under noisy supervision.
  • Figure 3: Test accuracy under varying label noise rates using different numbers of multi-view prompts $N \in \{1, 2, 4, 8\}$.
  • Figure 4: Visualization of bi-directional multi-view prompts. (a) The image; (b) The learned multi-view clean prompts; (c) The learned multi-view noisy prompts.