Table of Contents
Fetching ...

Hybrid-DMKG: A Hybrid Reasoning Framework over Dynamic Multimodal Knowledge Graphs for Multimodal Multihop QA with Knowledge Editing

Li Yuan, Qingfei Huang, Bingshan Zhu, Yi Cai, Qingbao Huang, Changmeng Zheng, Zikun Deng, Tao Wang

TL;DR

This work tackles the challenge of multimodal multihop question answering with knowledge editing by introducing MMQAKE, a benchmark that emphasizes intermediate reasoning fidelity and robustness to visual variants. It proposes Hybrid-DMKG, a dynamic multimodal knowledge graph based framework that decomposes complex queries, retrieves cross-modal facts, and reasons along two parallel paths with a reflective decision module to harmonize outputs. Experimental results show Hybrid-DMKG outperforms existing MKE baselines on MMQAKE across backbones and remains robust to edits and image rephrasings, with ablations confirming the critical roles of decomposition, retrieval, and the reflective component. The approach advances reliable multimodal reasoning under evolving knowledge and points toward future work on temporal knowledge updates and end-to-end multihop reasoning without fixed sub-questions.

Abstract

Multimodal Knowledge Editing (MKE) extends traditional knowledge editing to settings involving both textual and visual modalities. However, existing MKE benchmarks primarily assess final answer correctness while neglecting the quality of intermediate reasoning and robustness to visually rephrased inputs. To address this limitation, we introduce MMQAKE, the first benchmark for multimodal multihop question answering with knowledge editing. MMQAKE evaluates (1) a model's ability to reason over 2-5-hop factual chains that span both text and images, including performance at each intermediate step, and (2) robustness to visually rephrased inputs in multihop questions. Our evaluation shows that current MKE methods often struggle to consistently update and reason over multimodal reasoning chains after knowledge edits. To overcome these challenges, we propose Hybrid-DMKG, a hybrid reasoning framework built on a dynamic multimodal knowledge graph (DMKG) to enable accurate multihop reasoning over updated multimodal knowledge. Hybrid-DMKG first uses a large language model to decompose multimodal multihop questions into sequential sub-questions, then applies a multimodal retrieval model to locate updated facts by jointly encoding each sub-question with candidate entities and their associated images. For answer inference, a hybrid reasoning module operates over the DMKG via two parallel paths: (1) relation linking prediction, and (2) RAG reasoning with large vision-language models. A decision module aggregates evidence from both paths to select the most credible answer. Experimental results on MMQAKE show that Hybrid-DMKG significantly outperforms existing MKE approaches, achieving higher accuracy and improved robustness to knowledge updates.

Hybrid-DMKG: A Hybrid Reasoning Framework over Dynamic Multimodal Knowledge Graphs for Multimodal Multihop QA with Knowledge Editing

TL;DR

This work tackles the challenge of multimodal multihop question answering with knowledge editing by introducing MMQAKE, a benchmark that emphasizes intermediate reasoning fidelity and robustness to visual variants. It proposes Hybrid-DMKG, a dynamic multimodal knowledge graph based framework that decomposes complex queries, retrieves cross-modal facts, and reasons along two parallel paths with a reflective decision module to harmonize outputs. Experimental results show Hybrid-DMKG outperforms existing MKE baselines on MMQAKE across backbones and remains robust to edits and image rephrasings, with ablations confirming the critical roles of decomposition, retrieval, and the reflective component. The approach advances reliable multimodal reasoning under evolving knowledge and points toward future work on temporal knowledge updates and end-to-end multihop reasoning without fixed sub-questions.

Abstract

Multimodal Knowledge Editing (MKE) extends traditional knowledge editing to settings involving both textual and visual modalities. However, existing MKE benchmarks primarily assess final answer correctness while neglecting the quality of intermediate reasoning and robustness to visually rephrased inputs. To address this limitation, we introduce MMQAKE, the first benchmark for multimodal multihop question answering with knowledge editing. MMQAKE evaluates (1) a model's ability to reason over 2-5-hop factual chains that span both text and images, including performance at each intermediate step, and (2) robustness to visually rephrased inputs in multihop questions. Our evaluation shows that current MKE methods often struggle to consistently update and reason over multimodal reasoning chains after knowledge edits. To overcome these challenges, we propose Hybrid-DMKG, a hybrid reasoning framework built on a dynamic multimodal knowledge graph (DMKG) to enable accurate multihop reasoning over updated multimodal knowledge. Hybrid-DMKG first uses a large language model to decompose multimodal multihop questions into sequential sub-questions, then applies a multimodal retrieval model to locate updated facts by jointly encoding each sub-question with candidate entities and their associated images. For answer inference, a hybrid reasoning module operates over the DMKG via two parallel paths: (1) relation linking prediction, and (2) RAG reasoning with large vision-language models. A decision module aggregates evidence from both paths to select the most credible answer. Experimental results on MMQAKE show that Hybrid-DMKG significantly outperforms existing MKE approaches, achieving higher accuracy and improved robustness to knowledge updates.

Paper Structure

This paper contains 30 sections, 11 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: An example of our benchmark (MMQAKE), which differs in evaluation from existing MKE benchmarks.
  • Figure 2: Overall framework of Hybrid-DMKG for MMQAKE task.
  • Figure 3: Performance comparison of different hops on MMQAKE using the original and rephrased input images.
  • Figure 4: A case study of Hybrid-DMKG solving a 3-hop question from MMQAKE. The phrase highlighted in blue (e.g., “located in”) represents the extracted relation keyword. “Linking” and “RAG” refer to the outputs of the Relation-Linking Prediction and RAG-Enhanced Reasoning modules within the LVLM, respectively. The "GT set” denotes the ground truth set.
  • Figure 5: Prompt template used for generating multihop questions in the MMQAKE task datasets.
  • ...and 3 more figures