M$^3$Prune: Hierarchical Communication Graph Pruning for Efficient Multi-Modal Multi-Agent Retrieval-Augmented Generation

Weizi Shao; Taolin Zhang; Zijie Zhou; Chen Chen; Chengyu Wang; Xiaofeng He

M$^3$Prune: Hierarchical Communication Graph Pruning for Efficient Multi-Modal Multi-Agent Retrieval-Augmented Generation

Weizi Shao, Taolin Zhang, Zijie Zhou, Chen Chen, Chengyu Wang, Xiaofeng He

TL;DR

M3Prune tackles the efficiency challenge in multi-modal retrieval-augmented generation by introducing hierarchical pruning of a multi-agent communication graph. It first sparsifies intra-modal interactions within textual and visual streams, then learns a pruned inter-modal topology to fuse cross-modal cues, with a progressive pruning schedule across rounds. The approach leverages DAG sampling and Gumbel-Softmax for differentiable edge selection, plus a modality-alignment loss to harmonize textual and visual semantics; together, these yield state-of-the-art results on Sci QA, Vidoseek, and MultimodalQA with reduced token usage. This framework enables scalable, robust, and interpretable multi-modal reasoning suitable for deployment in real-world mRAG systems.

Abstract

Recent advancements in multi-modal retrieval-augmented generation (mRAG), which enhance multi-modal large language models (MLLMs) with external knowledge, have demonstrated that the collective intelligence of multiple agents can significantly outperform a single model through effective communication. Despite impressive performance, existing multi-agent systems inherently incur substantial token overhead and increased computational costs, posing challenges for large-scale deployment. To address these issues, we propose a novel Multi-Modal Multi-agent hierarchical communication graph PRUNING framework, termed M$^3$Prune. Our framework eliminates redundant edges across different modalities, achieving an optimal balance between task performance and token overhead. Specifically, M$^3$Prune first applies intra-modal graph sparsification to textual and visual modalities, identifying the edges most critical for solving the task. Subsequently, we construct a dynamic communication topology using these key edges for inter-modal graph sparsification. Finally, we progressively prune redundant edges to obtain a more efficient and hierarchical topology. Extensive experiments on both general and domain-specific mRAG benchmarks demonstrate that our method consistently outperforms both single-agent and robust multi-agent mRAG systems while significantly reducing token consumption.

M$^3$Prune: Hierarchical Communication Graph Pruning for Efficient Multi-Modal Multi-Agent Retrieval-Augmented Generation

TL;DR

Abstract

M$^3$Prune: Hierarchical Communication Graph Pruning for Efficient Multi-Modal Multi-Agent Retrieval-Augmented Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (32)