Table of Contents
Fetching ...

CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering

Yuyang Hong, Jiaqi Gu, Yujin Lou, Lubin Fan, Qi Yang, Ying Wang, Kun Ding, Yue Wu, Shiming Xiang, Jieping Ye

TL;DR

CC-VQA is proposed, a novel training-free, conflict- and correlation-aware method for KB-VQA that achieves state-of-the-art performance, yielding absolute accuracy improvements of 3.3\% to 6.4\% compared to existing methods.

Abstract

Knowledge-based visual question answering (KB-VQA) demonstrates significant potential for handling knowledge-intensive tasks. However, conflicts arise between static parametric knowledge in vision language models (VLMs) and dynamically retrieved information due to the static model knowledge from pre-training. The outputs either ignore retrieved contexts or exhibit inconsistent integration with parametric knowledge, posing substantial challenges for KB-VQA. Current knowledge conflict mitigation methods primarily adapted from language-based approaches, focusing on context-level conflicts through engineered prompting strategies or context-aware decoding mechanisms. However, these methods neglect the critical role of visual information in conflicts and suffer from redundant retrieved contexts, which impair accurate conflict identification and effective mitigation. To address these limitations, we propose \textbf{CC-VQA}: a novel training-free, conflict- and correlation-aware method for KB-VQA. Our method comprises two core components: (1) Vision-Centric Contextual Conflict Reasoning, which performs visual-semantic conflict analysis across internal and external knowledge contexts; and (2) Correlation-Guided Encoding and Decoding, featuring positional encoding compression for low-correlation statements and adaptive decoding using correlation-weighted conflict scoring. Extensive evaluations on E-VQA, InfoSeek, and OK-VQA benchmarks demonstrate that CC-VQA achieves state-of-the-art performance, yielding absolute accuracy improvements of 3.3\% to 6.4\% compared to existing methods. Code is available at https://github.com/cqu-student/CC-VQA.

CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering

TL;DR

CC-VQA is proposed, a novel training-free, conflict- and correlation-aware method for KB-VQA that achieves state-of-the-art performance, yielding absolute accuracy improvements of 3.3\% to 6.4\% compared to existing methods.

Abstract

Knowledge-based visual question answering (KB-VQA) demonstrates significant potential for handling knowledge-intensive tasks. However, conflicts arise between static parametric knowledge in vision language models (VLMs) and dynamically retrieved information due to the static model knowledge from pre-training. The outputs either ignore retrieved contexts or exhibit inconsistent integration with parametric knowledge, posing substantial challenges for KB-VQA. Current knowledge conflict mitigation methods primarily adapted from language-based approaches, focusing on context-level conflicts through engineered prompting strategies or context-aware decoding mechanisms. However, these methods neglect the critical role of visual information in conflicts and suffer from redundant retrieved contexts, which impair accurate conflict identification and effective mitigation. To address these limitations, we propose \textbf{CC-VQA}: a novel training-free, conflict- and correlation-aware method for KB-VQA. Our method comprises two core components: (1) Vision-Centric Contextual Conflict Reasoning, which performs visual-semantic conflict analysis across internal and external knowledge contexts; and (2) Correlation-Guided Encoding and Decoding, featuring positional encoding compression for low-correlation statements and adaptive decoding using correlation-weighted conflict scoring. Extensive evaluations on E-VQA, InfoSeek, and OK-VQA benchmarks demonstrate that CC-VQA achieves state-of-the-art performance, yielding absolute accuracy improvements of 3.3\% to 6.4\% compared to existing methods. Code is available at https://github.com/cqu-student/CC-VQA.
Paper Structure (28 sections, 9 equations, 13 figures, 12 tables)

This paper contains 28 sections, 9 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Example of Knowledge Conflict in KB-VQA. The introduction of external knowledge can enhance performance in knowledge-intensive visual question answering tasks; however, it concurrently introduces error risks stemming from conflicts between model parametric knowledge and external sources. Leveraging visual semantic features from both query image and contexts (highlighted in red) may mitigate knowledge conflicts and improve response accuracy in KB-VQA tasks.
  • Figure 2: Similarity Statistics Between Contextual Sentences and Question.
  • Figure 3: Overview of CC-VQA. CC-VQA consists of two key components.(1) Visual-Centric Contextual Conflict Reasoning: CC-VQA extracts semantic visual descriptions from both parametric and external contexts, then reasons about knowledge conflicts with the summarized visual-centric information.(2) Correlation-Guided Encoding and Decoding: By computing fine-grained correlations, CC-VQA dynamically compresses low-correlation information during positional encoding and adjusts the output distribution during decoding based on these correlations.
  • Figure 4: Illustration of Visual-Centric Contextual Conflict Reasoning. CC-VQA explicitly extracts the model’s parametric context and the visual rationale for all contexts. Through visual rationales, CC-VQA identifies and summarizes visual-centric conflicts $R_{vis}$ to address conflicting information.
  • Figure 5: The Case of CC-VQA. Qualitative results from Encyclopedic-VQA (top row) and InfoSeek (bottom row) illustrate knowledge conflicts introduced by retrieved information and show the effective mitigation achieved by our model.
  • ...and 8 more figures