Table of Contents
Fetching ...

M$^3$Prune: Hierarchical Communication Graph Pruning for Efficient Multi-Modal Multi-Agent Retrieval-Augmented Generation

Weizi Shao, Taolin Zhang, Zijie Zhou, Chen Chen, Chengyu Wang, Xiaofeng He

TL;DR

M3Prune tackles the efficiency challenge in multi-modal retrieval-augmented generation by introducing hierarchical pruning of a multi-agent communication graph. It first sparsifies intra-modal interactions within textual and visual streams, then learns a pruned inter-modal topology to fuse cross-modal cues, with a progressive pruning schedule across rounds. The approach leverages DAG sampling and Gumbel-Softmax for differentiable edge selection, plus a modality-alignment loss to harmonize textual and visual semantics; together, these yield state-of-the-art results on Sci QA, Vidoseek, and MultimodalQA with reduced token usage. This framework enables scalable, robust, and interpretable multi-modal reasoning suitable for deployment in real-world mRAG systems.

Abstract

Recent advancements in multi-modal retrieval-augmented generation (mRAG), which enhance multi-modal large language models (MLLMs) with external knowledge, have demonstrated that the collective intelligence of multiple agents can significantly outperform a single model through effective communication. Despite impressive performance, existing multi-agent systems inherently incur substantial token overhead and increased computational costs, posing challenges for large-scale deployment. To address these issues, we propose a novel Multi-Modal Multi-agent hierarchical communication graph PRUNING framework, termed M$^3$Prune. Our framework eliminates redundant edges across different modalities, achieving an optimal balance between task performance and token overhead. Specifically, M$^3$Prune first applies intra-modal graph sparsification to textual and visual modalities, identifying the edges most critical for solving the task. Subsequently, we construct a dynamic communication topology using these key edges for inter-modal graph sparsification. Finally, we progressively prune redundant edges to obtain a more efficient and hierarchical topology. Extensive experiments on both general and domain-specific mRAG benchmarks demonstrate that our method consistently outperforms both single-agent and robust multi-agent mRAG systems while significantly reducing token consumption.

M$^3$Prune: Hierarchical Communication Graph Pruning for Efficient Multi-Modal Multi-Agent Retrieval-Augmented Generation

TL;DR

M3Prune tackles the efficiency challenge in multi-modal retrieval-augmented generation by introducing hierarchical pruning of a multi-agent communication graph. It first sparsifies intra-modal interactions within textual and visual streams, then learns a pruned inter-modal topology to fuse cross-modal cues, with a progressive pruning schedule across rounds. The approach leverages DAG sampling and Gumbel-Softmax for differentiable edge selection, plus a modality-alignment loss to harmonize textual and visual semantics; together, these yield state-of-the-art results on Sci QA, Vidoseek, and MultimodalQA with reduced token usage. This framework enables scalable, robust, and interpretable multi-modal reasoning suitable for deployment in real-world mRAG systems.

Abstract

Recent advancements in multi-modal retrieval-augmented generation (mRAG), which enhance multi-modal large language models (MLLMs) with external knowledge, have demonstrated that the collective intelligence of multiple agents can significantly outperform a single model through effective communication. Despite impressive performance, existing multi-agent systems inherently incur substantial token overhead and increased computational costs, posing challenges for large-scale deployment. To address these issues, we propose a novel Multi-Modal Multi-agent hierarchical communication graph PRUNING framework, termed MPrune. Our framework eliminates redundant edges across different modalities, achieving an optimal balance between task performance and token overhead. Specifically, MPrune first applies intra-modal graph sparsification to textual and visual modalities, identifying the edges most critical for solving the task. Subsequently, we construct a dynamic communication topology using these key edges for inter-modal graph sparsification. Finally, we progressively prune redundant edges to obtain a more efficient and hierarchical topology. Extensive experiments on both general and domain-specific mRAG benchmarks demonstrate that our method consistently outperforms both single-agent and robust multi-agent mRAG systems while significantly reducing token consumption.

Paper Structure

This paper contains 33 sections, 14 equations, 32 figures, 8 tables, 1 algorithm.

Figures (32)

  • Figure 1: Comparison of our approach with existing methods. (1) Closed-book Reasoning does not consider the need for external knowledge. (2) Single-agent mRAG leverages an end-to-end MLLM solution combined with a retriever to answer all questions. (3) Multi-agent mRAG constructs a fixed communication topology to obtain collaborative answers, regardless of communication efficiency. (4) Our Multi-agent Pruning approach dynamically prunes unnecessary edge connections to enhance response consistency.
  • Figure 2: Overview of M$^3$Prune. Key components include: (1) Intra-modal Graph Sparsification: analyzes the input question using the multi-agent structures of textual and visual modalities, respectively; (2) Inter-modal Graph Sparsification: supplements semantic information across modalities through the interaction of multi-agent viewpoints in textual and visual modalities; (3) Progressive Edge Pruning: prunes redundant edges in each round of the learning process. Due to the numerous connections between agents in the inter-modal stage, we illustrate the interaction of only one agent with dashed lines as an example.
  • Figure 3: The process of weights change in communication edge of visual-to-text (Left) and text-to-visual (Right) on ScienceQA.
  • Figure 4: Comparison of the trade-off between performance and token consumption for multi-agent models. The total token count is calculated as the sum of prompt tokens and completion tokens.
  • Figure 5: Performance under adversarial attacks, including input prompt and response perturbations on MultimodalQA.
  • ...and 27 more figures