Table of Contents
Fetching ...

CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment

Maoyuan Shao, Yutong Gao, Xinyang Huang, Chuang Zhu, Lijuan Sun, Guoshun Nan

TL;DR

A Confusion Bank is constructed to explicitly model stable confusion relationships across categories and misclassified samples, and a Multi-Granularity Difference Expert module is designed to jointly leverage semantic- and sample-level experts for more robust confusion-aware reasoning.

Abstract

Vision-language models like CLIP have achieved remarkable progress in cross-modal representation learning, yet suffer from systematic misclassifications among visually and semantically similar categories. We observe that such confusion patterns are not random but persistently occur between specific category pairs, revealing the model's intrinsic bias and limited fine-grained discriminative ability. To address this, we propose CAPT, a Confusion-Aware Prompt Tuning framework that enables models to learn from their own misalignment. Specifically, we construct a Confusion Bank to explicitly model stable confusion relationships across categories and misclassified samples. On this basis, we introduce a Semantic Confusion Miner (SEM) to capture global inter-class confusion through semantic difference and commonality prompts, and a Sample Confusion Miner (SAM) to retrieve representative misclassified instances from the bank and capture sample-level cues through a Diff-Manner Adapter that integrates global and local contexts. To further unify confusion information across different granularities, a Multi-Granularity Difference Expert (MGDE) module is designed to jointly leverage semantic- and sample-level experts for more robust confusion-aware reasoning. Extensive experiments on 11 benchmark datasets demonstrate that our method significantly reduces confusion-induced errors while enhancing the discriminability and generalization of both base and novel classes, successfully resolving 50.72 percent of confusable sample pairs. Code will be released at https://github.com/greatest-gourmet/CAPT.

CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment

TL;DR

A Confusion Bank is constructed to explicitly model stable confusion relationships across categories and misclassified samples, and a Multi-Granularity Difference Expert module is designed to jointly leverage semantic- and sample-level experts for more robust confusion-aware reasoning.

Abstract

Vision-language models like CLIP have achieved remarkable progress in cross-modal representation learning, yet suffer from systematic misclassifications among visually and semantically similar categories. We observe that such confusion patterns are not random but persistently occur between specific category pairs, revealing the model's intrinsic bias and limited fine-grained discriminative ability. To address this, we propose CAPT, a Confusion-Aware Prompt Tuning framework that enables models to learn from their own misalignment. Specifically, we construct a Confusion Bank to explicitly model stable confusion relationships across categories and misclassified samples. On this basis, we introduce a Semantic Confusion Miner (SEM) to capture global inter-class confusion through semantic difference and commonality prompts, and a Sample Confusion Miner (SAM) to retrieve representative misclassified instances from the bank and capture sample-level cues through a Diff-Manner Adapter that integrates global and local contexts. To further unify confusion information across different granularities, a Multi-Granularity Difference Expert (MGDE) module is designed to jointly leverage semantic- and sample-level experts for more robust confusion-aware reasoning. Extensive experiments on 11 benchmark datasets demonstrate that our method significantly reduces confusion-induced errors while enhancing the discriminability and generalization of both base and novel classes, successfully resolving 50.72 percent of confusable sample pairs. Code will be released at https://github.com/greatest-gourmet/CAPT.
Paper Structure (20 sections, 13 equations, 11 figures, 12 tables)

This paper contains 20 sections, 13 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Heatmap of model misclassifications shows certain categories are consistently and frequently mispredicted as specific others, e.g., in OxfordPets, terrier is misclassified as bulldog 30 times while rarely being mistaken for other classes. CAPT significantly reduces such confusion rates, thereby improving overall accuracy.
  • Figure 2: Overview of CAPT. By matching feature representations, we first employ a Semantic Confusion Miner (SEM) that, together with statistics from the Confusion Bank, identifies Semantic Confusion Pairs and generates both commonality and difference prompts. Subsequently, the Sample Confusion Miner (SAM) locates the most representative confusing samples based on these pairs and extracts their Sample Confusion Feature via the Diff-Manner Adapter. Finally, the Multi-Granularity Discrepancy Expert (MGDE) module integrates semantic and sample level confusion information for unified representation refinement. Framework of MGDE is shown in Figure \ref{['fig:mgde']}.
  • Figure 3: Overview of Multi-Granularity Discrepancy Expert. MGDE integrates semantic and sample level experts to fuse dual confusion cues, reinforced by random vectors and specific initialization to enhance confusion learning in prompt tuning.
  • Figure 4: Correction Rate of misclassified samples stored in Confusion Bank.
  • Figure 5: Grad-Cam of the effects of Diff-Manner Adapter, only using the global part, the local part, and the whole.
  • ...and 6 more figures