Table of Contents
Fetching ...

Implicit-Knowledge Visual Question Answering with Structured Reasoning Traces

Zhihao Wen, Wenkang Wei, Yuan Fang, Xingtong Yu, Hui Zhang, Weicheng Zhu, Xin Zhang

TL;DR

This work tackles implicit-knowledge KVQA (IK-KVQA) by introducing StaR-KVQA, which injects structured reasoning traces into a single-model, no-retrieval framework. It builds a dual-path planner over symbolic text and visual relations and a reasoning composer to generate path-grounded explanations, with offline trace selection forming an augmented training set. Fine-tuning via structure-aware self-distillation yields single-pass inference that reveals intermediate traces and improves accuracy on OK-VQA by up to $+11.3\%$, outperforming strong baselines including closed-source models. The approach demonstrates robust cross-domain generalization and enhanced interpretability, while acknowledging limitations in faithfulness guarantees and residual hallucination, with future work directed at verification modules and broader domain evaluation.

Abstract

Knowledge-based Visual Question Answering (KVQA) requires models to ground entities in images and reason over factual knowledge. Recent work has introduced its implicit-knowledge variant, IK-KVQA, where a multimodal large language model (MLLM) is the sole knowledge source and answers are produced without external retrieval. Existing IK-KVQA approaches, however, are typically trained with answer-only supervision: reasoning remains implicit, justifications are often weak or inconsistent, and generalization after standard supervised fine-tuning (SFT) can be brittle. We propose MODELNAME, a framework that equips IK-KVQA with dual-path structured reasoning traces (symbolic relation paths over text and vision together with path-grounded natural-language explanations) to provide a stronger inductive bias than generic answer-only supervision. These traces act as modality-aware scaffolds that guide the model toward relevant entities and attributes, offering more structure than generic chain-of-thought supervision while not constraining reasoning to any single fixed path. Using a single open-source MLLM, MODELNAME constructs and selects traces to build an offline trace-enriched dataset and then performs structure-aware self-distillation; no external retrievers, verifiers, or curated knowledge bases are used, and inference is a single autoregressive pass. Across benchmarks, MODELNAME consistently improves both answer accuracy and the transparency of intermediate reasoning, achieving up to 11.3% higher answer accuracy on OK-VQA over the strongest baseline.

Implicit-Knowledge Visual Question Answering with Structured Reasoning Traces

TL;DR

This work tackles implicit-knowledge KVQA (IK-KVQA) by introducing StaR-KVQA, which injects structured reasoning traces into a single-model, no-retrieval framework. It builds a dual-path planner over symbolic text and visual relations and a reasoning composer to generate path-grounded explanations, with offline trace selection forming an augmented training set. Fine-tuning via structure-aware self-distillation yields single-pass inference that reveals intermediate traces and improves accuracy on OK-VQA by up to , outperforming strong baselines including closed-source models. The approach demonstrates robust cross-domain generalization and enhanced interpretability, while acknowledging limitations in faithfulness guarantees and residual hallucination, with future work directed at verification modules and broader domain evaluation.

Abstract

Knowledge-based Visual Question Answering (KVQA) requires models to ground entities in images and reason over factual knowledge. Recent work has introduced its implicit-knowledge variant, IK-KVQA, where a multimodal large language model (MLLM) is the sole knowledge source and answers are produced without external retrieval. Existing IK-KVQA approaches, however, are typically trained with answer-only supervision: reasoning remains implicit, justifications are often weak or inconsistent, and generalization after standard supervised fine-tuning (SFT) can be brittle. We propose MODELNAME, a framework that equips IK-KVQA with dual-path structured reasoning traces (symbolic relation paths over text and vision together with path-grounded natural-language explanations) to provide a stronger inductive bias than generic answer-only supervision. These traces act as modality-aware scaffolds that guide the model toward relevant entities and attributes, offering more structure than generic chain-of-thought supervision while not constraining reasoning to any single fixed path. Using a single open-source MLLM, MODELNAME constructs and selects traces to build an offline trace-enriched dataset and then performs structure-aware self-distillation; no external retrievers, verifiers, or curated knowledge bases are used, and inference is a single autoregressive pass. Across benchmarks, MODELNAME consistently improves both answer accuracy and the transparency of intermediate reasoning, achieving up to 11.3% higher answer accuracy on OK-VQA over the strongest baseline.

Paper Structure

This paper contains 16 sections, 8 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Traditional KVQA vs. implicit-knowledge KVQA (IK-KVQA). Traditional KVQA often relies on external knowledge sources (e.g., retrieval or KGs) on top of a perception backbone. In contrast, IK-KVQA retains the "K" to emphasize its knowledge-based nature while removing external sources: answers are predicted solely from $(I, Q)$ and parametric knowledge $f_\theta(I, Q)$. This stricter setting simplifies system design and improves scalability, but also raises the bar on how models acquire and utilize internal knowledge to support accurate predictions.
  • Figure 2: Overview of StaR-KVQA. Given a training image–text pair, a single $\mathrm{MLLM}_{\phi}$ generates multiple dual relation paths (a) and corresponding explanations (b). A selector (c) identifies the most consistent triplet, which, combined with the ground-truth answer, forms reasoning-augmented supervision (d). The fine-tuned $f_\theta'$ then performs single-pass inference (e), producing reasoning traces and answers without external knowledge. An example dual-path scaffold is shown, highlighting relevant visual attributes and semantic priors; paths need not be minimal or sufficient but guide the model toward useful evidence before composing a full explanation.
  • Figure 3: $K$, the number of candidate paths.