GauDP: Reinventing Multi-Agent Collaboration through Gaussian-Image Synergy in Diffusion Policies
Ziye Wang, Li Kang, Yiran Qin, Jiahua Ma, Zhanglin Peng, Lei Bai, Ruimao Zhang
TL;DR
GauDP tackles multi-agent collaboration by building a globally consistent $3D$ Gaussian field from multi-view RGB inputs and redistributing the context to individual agents, enabling each agent to act with awareness of the whole scene while preserving local control. It leverages a feed-forward variant of $3D$ Gaussian Splatting via Noposplat to reconstruct the global context from sparse, unposed views, supervised by depth cues through a reconstruction loss $L_{rec} = L_{rgb} + α L_{depth}$. The global context is selectively dispatched back to agents as view-specific Gaussians and fused with local features through a pixel-aligned fusion module, with a diffusion-policy backbone conditioned on $(oldsymbol{O}, oldsymbol{G})$. Empirically, GauDP outperforms image-based diffusion baselines and rivals point-cloud-based methods on RoboFactory tasks, demonstrating improved scalability to more agents and robustness with RGB input alone.
Abstract
Recently, effective coordination in embodied multi-agent systems has remained a fundamental challenge, particularly in scenarios where agents must balance individual perspectives with global environmental awareness. Existing approaches often struggle to balance fine-grained local control with comprehensive scene understanding, resulting in limited scalability and compromised collaboration quality. In this paper, we present GauDP, a novel Gaussian-image synergistic representation that facilitates scalable, perception-aware imitation learning in multi-agent collaborative systems. Specifically, GauDP constructs a globally consistent 3D Gaussian field from decentralized RGB observations, then dynamically redistributes 3D Gaussian attributes to each agent's local perspective. This enables all agents to adaptively query task-critical features from the shared scene representation while maintaining their individual viewpoints. This design facilitates both fine-grained control and globally coherent behavior without requiring additional sensing modalities (e.g., 3D point cloud). We evaluate GauDP on the RoboFactory benchmark, which includes diverse multi-arm manipulation tasks. Our method achieves superior performance over existing image-based methods and approaches the effectiveness of point-cloud-driven methods, while maintaining strong scalability as the number of agents increases.
