Table of Contents
Fetching ...

MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs

Xueyao Wan, Hang Yu

TL;DR

This work proposes MMGraphRAG, which integrates visual scene graphs with text KGs via a novel cross-modal fusion approach, and introduces SpecLink, a method leveraging spectral clustering for accurate cross-modal entity linking and path-based retrieval to guide generation.

Abstract

Large Language Models (LLMs) often suffer from hallucinations, which Retrieval-Augmented Generation (RAG) and GraphRAG mitigate by incorporating external knowledge and knowledge graphs (KGs). However, GraphRAG remains text-centric due to the difficulty of constructing fine-grained Multimodal KGs (MMKGs). Existing fusion methods, such as shared embeddings or captioning, require task-specific training and fail to preserve visual structural knowledge or cross-modal reasoning paths. To bridge this gap, we propose MMGraphRAG, which integrates visual scene graphs with text KGs via a novel cross-modal fusion approach. It introduces SpecLink, a method leveraging spectral clustering for accurate cross-modal entity linking and path-based retrieval to guide generation. We also release the CMEL dataset, specifically designed for fine-grained multi-entity alignment in complex multimodal scenarios. Evaluations on CMEL, DocBench, and MMLongBench demonstrate that MMGraphRAG achieves state-of-the-art performance, showing robust domain adaptability and superior multimodal information processing capabilities.

MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs

TL;DR

This work proposes MMGraphRAG, which integrates visual scene graphs with text KGs via a novel cross-modal fusion approach, and introduces SpecLink, a method leveraging spectral clustering for accurate cross-modal entity linking and path-based retrieval to guide generation.

Abstract

Large Language Models (LLMs) often suffer from hallucinations, which Retrieval-Augmented Generation (RAG) and GraphRAG mitigate by incorporating external knowledge and knowledge graphs (KGs). However, GraphRAG remains text-centric due to the difficulty of constructing fine-grained Multimodal KGs (MMKGs). Existing fusion methods, such as shared embeddings or captioning, require task-specific training and fail to preserve visual structural knowledge or cross-modal reasoning paths. To bridge this gap, we propose MMGraphRAG, which integrates visual scene graphs with text KGs via a novel cross-modal fusion approach. It introduces SpecLink, a method leveraging spectral clustering for accurate cross-modal entity linking and path-based retrieval to guide generation. We also release the CMEL dataset, specifically designed for fine-grained multi-entity alignment in complex multimodal scenarios. Evaluations on CMEL, DocBench, and MMLongBench demonstrate that MMGraphRAG achieves state-of-the-art performance, showing robust domain adaptability and superior multimodal information processing capabilities.

Paper Structure

This paper contains 39 sections, 8 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison of Image-Text Fusion Methods. Prior methods struggle with accurate visual reasoning: (a) Captioning-based methods linearize the image into a single text description, irretrievably losing fine-grained details. (b) Joint Extraction depends on precise annotations and fails in complex scenarios where key visual entities, like the logo itself, are unlabeled. (c) Shared Embedding MRAG struggles to isolate specific attributes from a flattened, implicit vector. In contrast, (d) our MMGraphRAG constructs a Multimodal Knowledge Graph, representing the logo and football as explicit visual nodes linked to text. This preserves the essential knowledge structure, enabling precise and interpretable answers.
  • Figure 2: MMGraphRAG Framework Overview. The framework begins by parsing sources like novels and webpages, creating parallel knowledge graphs through Text2Graph for text and Image2Graph for visual content. The central Cross-Modal KG Fusion module then unifies these graphs by intelligently linking entities; for example, it uses SpecLink method to align the textual mention of the "Dr. Aris" with the corresponding "woman" in the image. This process yields a cohesive MMKG. Finally, during the Retrieval and Generation stage, this MMKG provides structured, cross-modal context to an MLLM/LLM system, enabling it to answer complex queries with accurate, well-supported responses.
  • Figure 3: An Example of the Img2Graph Module in Action
  • Figure 4: Entity Distribution Across Document Domains. The figure illustrates how entity types and quantities vary among news, academic papers, and novels, demonstrating the dataset's diversity and the challenge of cross-modal entity alignment.
  • Figure 5: Confusion matrices comparing human majority votes with Llama3.1-70B-Instruct as an automatic judge on (a) DocBench and (b) MMLongBench. The LLM judge exhibits strong alignment with human judgments across both datasets.
  • ...and 3 more figures