Table of Contents
Fetching ...

HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation

Pei Liu, Xin Liu, Ruoyu Yao, Junming Liu, Siyuan Meng, Ding Wang, Jun Ma

TL;DR

HM-RAG tackles the challenge of complex multimodal queries by introducing a hierarchical, three-tier multi-agent framework that decomposes questions, retrieves across vector, graph, and web modalities in parallel, and refines answers through consensus and expert-driven refinement. Its multimodal knowledge pre-processing fuses text, images, and graphs into grounded knowledge representations via BLIP-2 and LightRAG, enabling robust cross-modal reasoning. Across ScienceQA and CrisisMMD, HM-RAG achieves state-of-the-art zero-shot performance and substantial improvements over single-source RAG baselines, demonstrating strong scalability, modularity, and governance-friendly data integration. The approach promises practical impact in diverse domains requiring precise multimodal information synthesis, with code available at the project repository.

Abstract

While Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge, conventional single-agent RAG remains fundamentally limited in resolving complex queries demanding coordinated reasoning across heterogeneous data ecosystems. We present HM-RAG, a novel Hierarchical Multi-agent Multimodal RAG framework that pioneers collaborative intelligence for dynamic knowledge synthesis across structured, unstructured, and graph-based data. The framework is composed of three-tiered architecture with specialized agents: a Decomposition Agent that dissects complex queries into contextually coherent sub-tasks via semantic-aware query rewriting and schema-guided context augmentation; Multi-source Retrieval Agents that carry out parallel, modality-specific retrieval using plug-and-play modules designed for vector, graph, and web-based databases; and a Decision Agent that uses consistency voting to integrate multi-source answers and resolve discrepancies in retrieval results through Expert Model Refinement. This architecture attains comprehensive query understanding by combining textual, graph-relational, and web-derived evidence, resulting in a remarkable 12.95% improvement in answer accuracy and a 3.56% boost in question classification accuracy over baseline RAG systems on the ScienceQA and CrisisMMD benchmarks. Notably, HM-RAG establishes state-of-the-art results in zero-shot settings on both datasets. Its modular architecture ensures seamless integration of new data modalities while maintaining strict data governance, marking a significant advancement in addressing the critical challenges of multimodal reasoning and knowledge synthesis in RAG systems. Code is available at https://github.com/ocean-luna/HMRAG.

HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation

TL;DR

HM-RAG tackles the challenge of complex multimodal queries by introducing a hierarchical, three-tier multi-agent framework that decomposes questions, retrieves across vector, graph, and web modalities in parallel, and refines answers through consensus and expert-driven refinement. Its multimodal knowledge pre-processing fuses text, images, and graphs into grounded knowledge representations via BLIP-2 and LightRAG, enabling robust cross-modal reasoning. Across ScienceQA and CrisisMMD, HM-RAG achieves state-of-the-art zero-shot performance and substantial improvements over single-source RAG baselines, demonstrating strong scalability, modularity, and governance-friendly data integration. The approach promises practical impact in diverse domains requiring precise multimodal information synthesis, with code available at the project repository.

Abstract

While Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge, conventional single-agent RAG remains fundamentally limited in resolving complex queries demanding coordinated reasoning across heterogeneous data ecosystems. We present HM-RAG, a novel Hierarchical Multi-agent Multimodal RAG framework that pioneers collaborative intelligence for dynamic knowledge synthesis across structured, unstructured, and graph-based data. The framework is composed of three-tiered architecture with specialized agents: a Decomposition Agent that dissects complex queries into contextually coherent sub-tasks via semantic-aware query rewriting and schema-guided context augmentation; Multi-source Retrieval Agents that carry out parallel, modality-specific retrieval using plug-and-play modules designed for vector, graph, and web-based databases; and a Decision Agent that uses consistency voting to integrate multi-source answers and resolve discrepancies in retrieval results through Expert Model Refinement. This architecture attains comprehensive query understanding by combining textual, graph-relational, and web-derived evidence, resulting in a remarkable 12.95% improvement in answer accuracy and a 3.56% boost in question classification accuracy over baseline RAG systems on the ScienceQA and CrisisMMD benchmarks. Notably, HM-RAG establishes state-of-the-art results in zero-shot settings on both datasets. Its modular architecture ensures seamless integration of new data modalities while maintaining strict data governance, marking a significant advancement in addressing the critical challenges of multimodal reasoning and knowledge synthesis in RAG systems. Code is available at https://github.com/ocean-luna/HMRAG.

Paper Structure

This paper contains 23 sections, 14 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison of ($a$) single-agent single-modal RAG and ($b$) multi-agent multimodal RAG. The multi-agent multimodal RAG processes multimodal data by converting them into vector and graph databases. It leverages multi-source retrieval across vector, graph, and web-based databases, enabling more comprehensive and efficient information retrieval. This advanced approach allows the multi-agent multimodal RAG to achieve superior performance in handling complex queries and diverse data types, setting it apart from the more limited single-agent single-modal RAG.
  • Figure 2: Overview of HM-RAG. A multi-agent multi-modal framework operates in three stages: First, the Decomposition Agent uses an LLM to rewrite and decompose the question into several sub-queries. Second, the Multi-source Retrieval Agent retrieves the top-k relevant documents from vector-, graph- and web-based sources as needed. Finally, the Decision Agent provides a voting mechanism and refinement process to generate the final answer.
  • Figure 3: Case Study: Comparison Between HM-RAG and the Baseline Methods (Vector-based, Graph-based, and Web-based Retrieval Agent).
  • Figure 4: Comparison on single-modal question answering.
  • Figure 5: Comparison on multimodal question answering.