Table of Contents
Fetching ...

Wireless Agentic AI with Retrieval-Augmented Multimodal Semantic Perception

Guangyuan Liu, Yinqiu Liu, Ruichen Zhang, Hongyang Du, Dusit Niyato, Zehui Xiong, Sumei Sun, Abbas Jamalipour

TL;DR

RAMSemCom addresses the challenge of exchanging semantically rich multimodal information in bandwidth-limited wireless multi-agent systems. It combines retrieval-augmented perception with a semantic communication framework and a DRL-based scheduler to adaptively select and transmit only the most relevant multimodal content. The approach employs iterative semantic refinement, top-$k$ patch retrieval, and centralized scheduling to balance semantic fidelity against bandwidth usage. A case study in multi-agent autonomous driving demonstrates faster task completion and reduced communication overhead compared with baselines, underscoring practical value for real-time cooperative AI.

Abstract

The rapid development of multimodal AI and Large Language Models (LLMs) has greatly enhanced real-time interaction, decision-making, and collaborative tasks. However, in wireless multi-agent scenarios, limited bandwidth poses significant challenges to exchanging semantically rich multimodal information efficiently. Traditional semantic communication methods, though effective, struggle with redundancy and loss of crucial details. To overcome these challenges, we propose a Retrieval-Augmented Multimodal Semantic Communication (RAMSemCom) framework. RAMSemCom incorporates iterative, retrieval-driven semantic refinement tailored for distributed multi-agent environments, enabling efficient exchange of critical multimodal elements through local caching and selective transmission. Our approach dynamically optimizes retrieval using deep reinforcement learning (DRL) to balance semantic fidelity with bandwidth constraints. A comprehensive case study on multi-agent autonomous driving demonstrates that our DRL-based retrieval strategy significantly improves task completion efficiency and reduces communication overhead compared to baseline methods.

Wireless Agentic AI with Retrieval-Augmented Multimodal Semantic Perception

TL;DR

RAMSemCom addresses the challenge of exchanging semantically rich multimodal information in bandwidth-limited wireless multi-agent systems. It combines retrieval-augmented perception with a semantic communication framework and a DRL-based scheduler to adaptively select and transmit only the most relevant multimodal content. The approach employs iterative semantic refinement, top- patch retrieval, and centralized scheduling to balance semantic fidelity against bandwidth usage. A case study in multi-agent autonomous driving demonstrates faster task completion and reduced communication overhead compared with baselines, underscoring practical value for real-time cooperative AI.

Abstract

The rapid development of multimodal AI and Large Language Models (LLMs) has greatly enhanced real-time interaction, decision-making, and collaborative tasks. However, in wireless multi-agent scenarios, limited bandwidth poses significant challenges to exchanging semantically rich multimodal information efficiently. Traditional semantic communication methods, though effective, struggle with redundancy and loss of crucial details. To overcome these challenges, we propose a Retrieval-Augmented Multimodal Semantic Communication (RAMSemCom) framework. RAMSemCom incorporates iterative, retrieval-driven semantic refinement tailored for distributed multi-agent environments, enabling efficient exchange of critical multimodal elements through local caching and selective transmission. Our approach dynamically optimizes retrieval using deep reinforcement learning (DRL) to balance semantic fidelity with bandwidth constraints. A comprehensive case study on multi-agent autonomous driving demonstrates that our DRL-based retrieval strategy significantly improves task completion efficiency and reduces communication overhead compared to baseline methods.

Paper Structure

This paper contains 17 sections, 5 figures.

Figures (5)

  • Figure 1: Multi-agent Frameworks: Survey and summary of popular multi-agent frameworks, emphasizing supported modalities, last update status, and notable applications. Multimodal Data in Multi-agent Scene: Illustration of multimodal multi-agent system scenarios demonstrating three common communication patterns: (1) Intradomain inter-agent vehicle coordination, (2) Intradomain intra-agent planning and collaboration (e.g., robot collaboration scenario by Helix AI at https://www.figure.ai/news/helix), and (3) Interdomain inter-agent logistics and transportation coordination.
  • Figure 2: Detailed hierarchical architecture of the RAMSemCom framework, highlighting three distinct layers: Data Level, RAMSemCom Level, Semantic Level, and Physical Level. The figure illustrates the complete process from semantic data encoding, iterative retrieval requests based on semantic uncertainty, and channel-aware retrieval scheduling. Each layer's function is clearly outlined, emphasizing the role of dynamic retrieval scheduling and iterative refinement in minimizing bandwidth usage and ensuring effective semantic interpretation.
  • Figure 3: Visualization of the retrieval-refinement process within RAMSemCom using a representative autonomous driving scenario. Initially, a low-resolution semantic summary of the scene is transmitted (A, B). Agents evaluate task-specific queries (e.g., traffic conditions ahead) and perform top-$k$ semantic similarity searches to identify relevant patches (C). A DRL-based scheduler dynamically allocates retrieval requests considering real-time channel conditions and semantic priorities (D). Iterative retrieval rounds progressively refine agents' semantic perception, leading to accurate decisions such as rerouting to avoid traffic congestion (E).
  • Figure 4: Comparison of reward performance for different patch retrieval strategies over training episodes. The DRL-based methods (PPO and DQN) significantly outperform fixed baselines, demonstrating their effectiveness in dynamically allocating semantic retrieval bandwidth. The NoRetrieval line ($k=0$) serves as a performance floor, while heuristic and random $k$ strategies show limited adaptability in complex task scenarios.
  • Figure 5: Cumulative number of successfully completed QA tasks as a function of transmission round. DRL-based methods (PPO and DQN) converge significantly faster than heuristic and non-learning baselines, highlighting the effectiveness of adaptive semantic patch allocation under bandwidth constraints.