Table of Contents
Fetching ...

GauDP: Reinventing Multi-Agent Collaboration through Gaussian-Image Synergy in Diffusion Policies

Ziye Wang, Li Kang, Yiran Qin, Jiahua Ma, Zhanglin Peng, Lei Bai, Ruimao Zhang

TL;DR

GauDP tackles multi-agent collaboration by building a globally consistent $3D$ Gaussian field from multi-view RGB inputs and redistributing the context to individual agents, enabling each agent to act with awareness of the whole scene while preserving local control. It leverages a feed-forward variant of $3D$ Gaussian Splatting via Noposplat to reconstruct the global context from sparse, unposed views, supervised by depth cues through a reconstruction loss $L_{rec} = L_{rgb} + α L_{depth}$. The global context is selectively dispatched back to agents as view-specific Gaussians and fused with local features through a pixel-aligned fusion module, with a diffusion-policy backbone conditioned on $(oldsymbol{O}, oldsymbol{G})$. Empirically, GauDP outperforms image-based diffusion baselines and rivals point-cloud-based methods on RoboFactory tasks, demonstrating improved scalability to more agents and robustness with RGB input alone.

Abstract

Recently, effective coordination in embodied multi-agent systems has remained a fundamental challenge, particularly in scenarios where agents must balance individual perspectives with global environmental awareness. Existing approaches often struggle to balance fine-grained local control with comprehensive scene understanding, resulting in limited scalability and compromised collaboration quality. In this paper, we present GauDP, a novel Gaussian-image synergistic representation that facilitates scalable, perception-aware imitation learning in multi-agent collaborative systems. Specifically, GauDP constructs a globally consistent 3D Gaussian field from decentralized RGB observations, then dynamically redistributes 3D Gaussian attributes to each agent's local perspective. This enables all agents to adaptively query task-critical features from the shared scene representation while maintaining their individual viewpoints. This design facilitates both fine-grained control and globally coherent behavior without requiring additional sensing modalities (e.g., 3D point cloud). We evaluate GauDP on the RoboFactory benchmark, which includes diverse multi-arm manipulation tasks. Our method achieves superior performance over existing image-based methods and approaches the effectiveness of point-cloud-driven methods, while maintaining strong scalability as the number of agents increases.

GauDP: Reinventing Multi-Agent Collaboration through Gaussian-Image Synergy in Diffusion Policies

TL;DR

GauDP tackles multi-agent collaboration by building a globally consistent Gaussian field from multi-view RGB inputs and redistributing the context to individual agents, enabling each agent to act with awareness of the whole scene while preserving local control. It leverages a feed-forward variant of Gaussian Splatting via Noposplat to reconstruct the global context from sparse, unposed views, supervised by depth cues through a reconstruction loss . The global context is selectively dispatched back to agents as view-specific Gaussians and fused with local features through a pixel-aligned fusion module, with a diffusion-policy backbone conditioned on . Empirically, GauDP outperforms image-based diffusion baselines and rivals point-cloud-based methods on RoboFactory tasks, demonstrating improved scalability to more agents and robustness with RGB input alone.

Abstract

Recently, effective coordination in embodied multi-agent systems has remained a fundamental challenge, particularly in scenarios where agents must balance individual perspectives with global environmental awareness. Existing approaches often struggle to balance fine-grained local control with comprehensive scene understanding, resulting in limited scalability and compromised collaboration quality. In this paper, we present GauDP, a novel Gaussian-image synergistic representation that facilitates scalable, perception-aware imitation learning in multi-agent collaborative systems. Specifically, GauDP constructs a globally consistent 3D Gaussian field from decentralized RGB observations, then dynamically redistributes 3D Gaussian attributes to each agent's local perspective. This enables all agents to adaptively query task-critical features from the shared scene representation while maintaining their individual viewpoints. This design facilitates both fine-grained control and globally coherent behavior without requiring additional sensing modalities (e.g., 3D point cloud). We evaluate GauDP on the RoboFactory benchmark, which includes diverse multi-arm manipulation tasks. Our method achieves superior performance over existing image-based methods and approaches the effectiveness of point-cloud-driven methods, while maintaining strong scalability as the number of agents increases.

Paper Structure

This paper contains 31 sections, 9 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Both local and global context are essential in multi-agent collaboration. Comparison of multi-agent decision-making using different types of contextual information. (a) Using only local context leads to miscoordination such as collisions due to lack of global awareness. (b) Using only global context provides a holistic scene view but lacks detailed local features, resulting in inaccurate control, such as biased localization. (c) Our proposed method, GauDP, fuses global context, which is reconstructed from 2D local images via a shared 3D Gaussian representation, on top of local observations. This integration enables both accurate localization and coordinated execution. Our proposed method, based solely on 2D observations, effectively aggregates global context on top of the local context.
  • Figure 2: (a) Overview of the proposed GauDP framework for multi-agent imitation learning. Each agent extracts a local context from its 2D observation. A shared 3D Gaussian field is constructed from all views to form the global context, which is fused with the local context and passed through an encoder. The resulting per-agent features are processed by a diffusion policy via cross-attention to predict actions. (b) Pipeline for constructing the global Gaussian field. Multi-view images are encoded and aggregated via cross-attention, followed by a reconstruction loss $\mathcal{L}_{\text{rec}}$ between rendered and input views to ensure consistency.
  • Figure 3: Visualization of Reconstruction Results. Our method achieves significantly improved reconstruction quality.