Table of Contents
Fetching ...

Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning

Heng Zhou, Li Kang, Yiran Qin, Xiufeng Song, Ao Yu, Zilu Zhang, Haoming Song, Kaixin Xu, Yuchen Fan, Dongzhan Zhou, Xiaohong Liu, Ruimao Zhang, Philip Torr, Lei Bai, Zhenfei Yin

Abstract

Understanding the world from distributed, partial viewpoints is a fundamental challenge for embodied multi-agent systems. Each agent perceives the environment through an ego-centric view that is often limited by occlusion and ambiguity. To study this problem, we introduce the Ego-to-World (E2W) benchmark, which evaluates a vision-language model's ability to fuse heterogeneous viewpoints across three tasks: (i) global counting, (ii) relational location reasoning, and (iii) action-oriented grasping that requires predicting view-specific image coordinates. To address this setting, we propose CoRL, a two-stage framework that combines Chain-of-Thought supervised fine-tuning with reinforcement learning using Group-Relative Policy Optimization. Its core component, the Cross-View Spatial Reward (CVSR), provides dense task-aligned feedback by linking reasoning steps to visual evidence, ensuring coherent cross-view entity resolution, and guiding the model toward correct final predictions. Experiments on E2W show that CoRL consistently surpasses strong proprietary and open-source baselines on both reasoning and perception-grounding metrics, while ablations further confirm the necessity of each CVSR component. Beyond that, CoRL generalizes to external spatial reasoning benchmarks and enables effective real-world multi-robot manipulation with calibrated multi-camera rigs, demonstrating cross-view localization and successful grasp-and-place execution. Together, E2W and CoRL provide a principled foundation for learning world-centric scene understanding from distributed, ego-centric observations, advancing collaborative embodied AI.

Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning

Abstract

Understanding the world from distributed, partial viewpoints is a fundamental challenge for embodied multi-agent systems. Each agent perceives the environment through an ego-centric view that is often limited by occlusion and ambiguity. To study this problem, we introduce the Ego-to-World (E2W) benchmark, which evaluates a vision-language model's ability to fuse heterogeneous viewpoints across three tasks: (i) global counting, (ii) relational location reasoning, and (iii) action-oriented grasping that requires predicting view-specific image coordinates. To address this setting, we propose CoRL, a two-stage framework that combines Chain-of-Thought supervised fine-tuning with reinforcement learning using Group-Relative Policy Optimization. Its core component, the Cross-View Spatial Reward (CVSR), provides dense task-aligned feedback by linking reasoning steps to visual evidence, ensuring coherent cross-view entity resolution, and guiding the model toward correct final predictions. Experiments on E2W show that CoRL consistently surpasses strong proprietary and open-source baselines on both reasoning and perception-grounding metrics, while ablations further confirm the necessity of each CVSR component. Beyond that, CoRL generalizes to external spatial reasoning benchmarks and enables effective real-world multi-robot manipulation with calibrated multi-camera rigs, demonstrating cross-view localization and successful grasp-and-place execution. Together, E2W and CoRL provide a principled foundation for learning world-centric scene understanding from distributed, ego-centric observations, advancing collaborative embodied AI.
Paper Structure (59 sections, 15 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 59 sections, 15 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: An illustration of collaborative spatial reasoning in embodied systems. Reasoning from a single viewpoint fails due to occlusions or a limited field of view. In contrast, cross-view compositional reasoning integrates multiple perspectives to correctly localize and grasp the target object---the blue block farthest from the strawberry.
  • Figure 2: Overview of the Ego-to-World (E2W) Benchmark.Top: Multiple agents (Robot A, B, C) each provide partial ego-centric views of a shared scene. The vision language model trained with our CoRL framework integrate these complementary perspectives to solve three tasks: Counting (E2W-1), Location Reasoning (E2W-2), and Grasping (E2W-3). Bottom: The benchmark combines diverse real and simulated data and organizes them into varying complexity levels.
  • Figure 3: CoRL framework. The model is first initialized via supervised fine-tuning (SFT) on Chain-of-Thought annotations, then refined with reinforcement learning (RL). During RL, the policy is optimized with an format reward and the Cross-View Spatial Reward (CVSR), which supplies dense feedback on cross-view fusion and spatial consistency, guiding robust collaborative reasoning.
  • Figure 4: Illustrative demonstrations of our model in real-world robotic manipulation tasks involving pick-and-place with cross-view spatial reasoning.
  • Figure 5: Example of an E2W-1 sample (Case 1). Two ego-centric views of a kitchen scene are provided, and the model must count the total number of pizzas by resolving cross-view object correspondences.
  • ...and 2 more figures