Table of Contents
Fetching ...

REAL: Resolving Knowledge Conflicts in Knowledge-Intensive Visual Question Answering via Reasoning-Pivot Alignment

Kai Ye, Xianwei Mao, Sheng Zhou, Zirui Shao, Ye Mo, Liangliang Liu, Haikuan Huang, Bin Li, Jiajun Bu

TL;DR

This paper tackles knowledge conflicts in knowledge-intensive visual question answering by introducing Reasoning-Pivot Alignment (REAL), a pivot-centric framework. It defines Reasoning-Pivots as indispensable units in multi-hop reasoning and formalizes pivot-specific conflicts, then couples pivot-aware supervision (RPA-SFT) with a training-free pivot-guided decoding (RPGD) to detect and mitigate conflicts. A dedicated REAL-VQA dataset supports fine-grained pivot annotations and conflict generation anchored to reliable Wikipedia contexts. Empirical results show improved conflict discrimination and state-of-the-art KI-VQA performance across benchmarks, with robust cross-domain generalization and a favorable latency-accuracy trade-off. This pivot-driven approach offers a principled path to reliable multimodal reasoning amidst retrieval noise.

Abstract

Knowledge-intensive Visual Question Answering (KI-VQA) frequently suffers from severe knowledge conflicts caused by the inherent limitations of open-domain retrieval. However, existing paradigms face critical limitations due to the lack of generalizable conflict detection and intra-model constraint mechanisms to handle conflicting evidence. To address these challenges, we propose the REAL (Reasoning-Pivot Alignment) framework centered on the novel concept of the Reasoning-Pivot. Distinct from reasoning steps that prioritize internal self-derivation, a reasoning-pivot serves as an atomic unit (node or edge) in the reasoning chain that emphasizes knowledge linkage, and it typically relies on external evidence to complete the reasoning. Supported by our constructed REAL-VQA dataset, our approach integrates Reasoning-Pivot Aware SFT (RPA-SFT) to train a generalizable discriminator by aligning conflicts with pivot extraction, and employs Reasoning-Pivot Guided Decoding (RPGD), an intra-model decoding strategy that leverages these pivots for targeted conflict mitigation. Extensive experiments across diverse benchmarks demonstrate that REAL significantly enhances discrimination accuracy and achieves state-of-the-art performance, validating the effectiveness of our pivot-driven resolution paradigm.

REAL: Resolving Knowledge Conflicts in Knowledge-Intensive Visual Question Answering via Reasoning-Pivot Alignment

TL;DR

This paper tackles knowledge conflicts in knowledge-intensive visual question answering by introducing Reasoning-Pivot Alignment (REAL), a pivot-centric framework. It defines Reasoning-Pivots as indispensable units in multi-hop reasoning and formalizes pivot-specific conflicts, then couples pivot-aware supervision (RPA-SFT) with a training-free pivot-guided decoding (RPGD) to detect and mitigate conflicts. A dedicated REAL-VQA dataset supports fine-grained pivot annotations and conflict generation anchored to reliable Wikipedia contexts. Empirical results show improved conflict discrimination and state-of-the-art KI-VQA performance across benchmarks, with robust cross-domain generalization and a favorable latency-accuracy trade-off. This pivot-driven approach offers a principled path to reliable multimodal reasoning amidst retrieval noise.

Abstract

Knowledge-intensive Visual Question Answering (KI-VQA) frequently suffers from severe knowledge conflicts caused by the inherent limitations of open-domain retrieval. However, existing paradigms face critical limitations due to the lack of generalizable conflict detection and intra-model constraint mechanisms to handle conflicting evidence. To address these challenges, we propose the REAL (Reasoning-Pivot Alignment) framework centered on the novel concept of the Reasoning-Pivot. Distinct from reasoning steps that prioritize internal self-derivation, a reasoning-pivot serves as an atomic unit (node or edge) in the reasoning chain that emphasizes knowledge linkage, and it typically relies on external evidence to complete the reasoning. Supported by our constructed REAL-VQA dataset, our approach integrates Reasoning-Pivot Aware SFT (RPA-SFT) to train a generalizable discriminator by aligning conflicts with pivot extraction, and employs Reasoning-Pivot Guided Decoding (RPGD), an intra-model decoding strategy that leverages these pivots for targeted conflict mitigation. Extensive experiments across diverse benchmarks demonstrate that REAL significantly enhances discrimination accuracy and achieves state-of-the-art performance, validating the effectiveness of our pivot-driven resolution paradigm.
Paper Structure (39 sections, 5 equations, 8 figures, 11 tables, 1 algorithm)

This paper contains 39 sections, 5 equations, 8 figures, 11 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison of Conflict Definitions.Conventional methods (Left) incorrectly flag irrelevant entity or keyword variations as conflicts, while the Reasoning-Pivot definition (Right) correctly distinguishes irrelevant location information from nationality information and only detects conflicts within the nationality pivot, treating unrelated locations as non-conflicting noise.
  • Figure 2: The proposed framework for KIVQA. (1) Data Processing augments the REAL-VQA training set by inserting special tokens, denoted as <RPivot> and </RPivot> (2) RPA-SFT fine-tunes the model with explicit reasoning-pivot awareness to guide the reasoning process. (3) RPGD employs a conflict-based contrastive decoding strategy to resolve ambiguities and ensure accurate reasoning.
  • Figure 3: Overview of the REAL-VQA data construction.
  • Figure 4: (Left) Accuracy vs. relative per-token latency for different decoding methods; (Right) Improvements of REAL on pivot based QA accuracy on E-VQA.
  • Figure 5: Case study on E-VQA, comparing logits under RPGD and greedy decoding, showing that RPGD better focuses on conflict knowledge and yields the correct answer.
  • ...and 3 more figures