Table of Contents
Fetching ...

CommCP: Efficient Multi-Agent Coordination via LLM-Based Communication with Conformal Prediction

Xiaopan Zhang, Zejin Wang, Zhixu Li, Jianpeng Yao, Jiachen Li

TL;DR

CommCP addresses cooperative information gathering in MM-EQA by introducing a decentralized LLM-based communication framework that uses conformal prediction to calibrate message confidence and reduce distractions. The approach enables robots to share only confidently relevant information, improving exploration efficiency and task success in photo-realistic HM3D scenarios. Extensive experiments show Calibrated Communication outperforms baselines across metrics, particularly in larger environments and under varying communication latencies, showing scalability to more complex multi-robot deployments. This work provides a practical, scalable strategy for multi-robot collaboration in human-guided embodied tasks.

Abstract

To complete assignments provided by humans in natural language, robots must interpret commands, generate and answer relevant questions for scene understanding, and manipulate target objects. Real-world deployments often require multiple heterogeneous robots with different manipulation capabilities to handle different assignments cooperatively. Beyond the need for specialized manipulation skills, effective information gathering is important in completing these assignments. To address this component of the problem, we formalize the information-gathering process in a fully cooperative setting as an underexplored multi-agent multi-task Embodied Question Answering (MM-EQA) problem, which is a novel extension of canonical Embodied Question Answering (EQA), where effective communication is crucial for coordinating efforts without redundancy. To address this problem, we propose CommCP, a novel LLM-based decentralized communication framework designed for MM-EQA. Our framework employs conformal prediction to calibrate the generated messages, thereby minimizing receiver distractions and enhancing communication reliability. To evaluate our framework, we introduce an MM-EQA benchmark featuring diverse, photo-realistic household scenarios with embodied questions. Experimental results demonstrate that CommCP significantly enhances the task success rate and exploration efficiency over baselines. The experiment videos, code, and dataset are available on our project website: https://comm-cp.github.io.

CommCP: Efficient Multi-Agent Coordination via LLM-Based Communication with Conformal Prediction

TL;DR

CommCP addresses cooperative information gathering in MM-EQA by introducing a decentralized LLM-based communication framework that uses conformal prediction to calibrate message confidence and reduce distractions. The approach enables robots to share only confidently relevant information, improving exploration efficiency and task success in photo-realistic HM3D scenarios. Extensive experiments show Calibrated Communication outperforms baselines across metrics, particularly in larger environments and under varying communication latencies, showing scalability to more complex multi-robot deployments. This work provides a practical, scalable strategy for multi-robot collaboration in human-guided embodied tasks.

Abstract

To complete assignments provided by humans in natural language, robots must interpret commands, generate and answer relevant questions for scene understanding, and manipulate target objects. Real-world deployments often require multiple heterogeneous robots with different manipulation capabilities to handle different assignments cooperatively. Beyond the need for specialized manipulation skills, effective information gathering is important in completing these assignments. To address this component of the problem, we formalize the information-gathering process in a fully cooperative setting as an underexplored multi-agent multi-task Embodied Question Answering (MM-EQA) problem, which is a novel extension of canonical Embodied Question Answering (EQA), where effective communication is crucial for coordinating efforts without redundancy. To address this problem, we propose CommCP, a novel LLM-based decentralized communication framework designed for MM-EQA. Our framework employs conformal prediction to calibrate the generated messages, thereby minimizing receiver distractions and enhancing communication reliability. To evaluate our framework, we introduce an MM-EQA benchmark featuring diverse, photo-realistic household scenarios with embodied questions. Experimental results demonstrate that CommCP significantly enhances the task success rate and exploration efficiency over baselines. The experiment videos, code, and dataset are available on our project website: https://comm-cp.github.io.
Paper Structure (17 sections, 5 equations, 5 figures)

This paper contains 17 sections, 5 equations, 5 figures.

Figures (5)

  • Figure 1: In a household setting, robots exchange observations and reasoning to collaboratively complete their assigned tasks. Each agent generates confident and goal-directed messages using calibrated outputs from LLMs. The bottom-left image shows a bird’s-eye view of Robot 1’s navigation path after incorporating information received from Robot 2. The top-right sequence captures both robots’ camera views at different timestamps.
  • Figure 2: An overview of our framework shows each robot with a perception module, a communication module, a planning module, and a confidence check module. At each time step, a robot generates local and global semantic values (SV) based on the current view and the communication message from the other robot. It navigates using a 2D weighted semantic value map and handles related object-check requests from other agents. The messages are generated based on the robot's current view, which are calibrated by conformal prediction to enhance relevance.
  • Figure 3: The diagrams of SC vs. NTC on our MM-EQA dataset. (a)–(d) show the results for a 2-robot team. (a) The comparison between our method and baselines. (b) The ablative comparison of communication and conformal prediction modules. (c) The ablative comparison of the number of objects in the communication messages and the answering sharing mechanism. (d) The ablative comparison of baselines with our method at message-sending speeds of 0.25, 0.5, 1, 2, and 4 messages per second. (e) A scalability analysis comparing our method and baselines using a 3-robot team.
  • Figure 4: The comparisons of robot views and global SV maps among three methods. The red points represent the location of the target and the green points represent the position of the robot. Agents in the three methods start from the same pose. The question for this scenario is "Where is the red bear cushion?" For "Ours-No-CP" and "Ours", Robot2 separately explores the same rooms at different times and sends messages to Robot1. The detailed messages are as follows: $\text{MSG}_{1}$: I see a basketboard, dolls, black chair that may be relevant to your target red bear cushion, and dolls may be your target at $\{position 1\}$. $\text{MSG}_{2}$: I see dolls that may be relevant to your target red bear cushion at $\{position 2\}$. $\text{MSG}_{3}$: I see bed, red pillow on blue chair that may be relevant to your target red bear cushion, and red pillow on blue chair may be your target at $\{position 3\}$. $\text{MSG}_{4}$: I see red pillow on blue chair that may be relevant to your target red bear cushion, and a red pillow on blue chair may be your target at $\{position 4\}$.
  • Figure 5: The comparison of performance improvement in the environments with different sizes. The "Advantage" represents the difference between the NTC of "Ours" and the NTC of MMFBE, calculated as $\text{Advantage} = \text{NTC}_{\text{Ours}} - \text{NTC}_{\text{MMFBE}}$. Size 1 represents scene area $L \times W < 150 \, \text{m}^2$. Size 2 represents $150 \leq L \times W < 250 \, \text{m}^2$. Size 3 represents $L \times W \geq 250 \, \text{m}^2$.