Table of Contents
Fetching ...

EgoPush: Learning End-to-End Egocentric Multi-Object Rearrangement for Mobile Robots

Boyuan An, Zhexiong Wang, Yipeng Wang, Jiaqi Li, Sihang Li, Jing Zhang, Chen Feng

TL;DR

EgoPush addresses long-horizon, non-prehensile rearrangement from purely egocentric vision by learning an object-centric latent representation and distilling privileged teacher knowledge into a visual student. A constrained RL teacher operates on sparse keypoints with egocentric visibility limits to produce learnable, reproducible trajectories, while the student uses RGB-D depth layers and relational distillation to inherit spatial reasoning without global state. Stage-wide, temporally decayed rewards improve credit assignment across subgoals, and relational losses align teacher–student latent relations to maintain task intent. The approach yields superior performance in simulation and real-world transfer (zero-shot sim-to-real) and demonstrates robustness to occlusion and perception gaps, highlighting the value of structured supervision and cross-modal distillation for egocentric mobile manipulation.

Abstract

Humans can rearrange objects in cluttered environments using egocentric perception, navigating occlusions without global coordinates. Inspired by this capability, we study long-horizon multi-object non-prehensile rearrangement for mobile robots using a single egocentric camera. We introduce EgoPush, a policy learning framework that enables egocentric, perception-driven rearrangement without relying on explicit global state estimation that often fails in dynamic scenes. EgoPush designs an object-centric latent space to encode relative spatial relations among objects, rather than absolute poses. This design enables a privileged reinforcement-learning (RL) teacher to jointly learn latent states and mobile actions from sparse keypoints, which is then distilled into a purely visual student policy. To reduce the supervision gap between the omniscient teacher and the partially observed student, we restrict the teacher's observations to visually accessible cues. This induces active perception behaviors that are recoverable from the student's viewpoint. To address long-horizon credit assignment, we decompose rearrangement into stage-level subproblems using temporally decayed, stage-local completion rewards. Extensive simulation experiments demonstrate that EgoPush significantly outperforms end-to-end RL baselines in success rate, with ablation studies validating each design choice. We further demonstrate zero-shot sim-to-real transfer on a mobile platform in the real world. Code and videos are available at https://ai4ce.github.io/EgoPush/.

EgoPush: Learning End-to-End Egocentric Multi-Object Rearrangement for Mobile Robots

TL;DR

EgoPush addresses long-horizon, non-prehensile rearrangement from purely egocentric vision by learning an object-centric latent representation and distilling privileged teacher knowledge into a visual student. A constrained RL teacher operates on sparse keypoints with egocentric visibility limits to produce learnable, reproducible trajectories, while the student uses RGB-D depth layers and relational distillation to inherit spatial reasoning without global state. Stage-wide, temporally decayed rewards improve credit assignment across subgoals, and relational losses align teacher–student latent relations to maintain task intent. The approach yields superior performance in simulation and real-world transfer (zero-shot sim-to-real) and demonstrates robustness to occlusion and perception gaps, highlighting the value of structured supervision and cross-modal distillation for egocentric mobile manipulation.

Abstract

Humans can rearrange objects in cluttered environments using egocentric perception, navigating occlusions without global coordinates. Inspired by this capability, we study long-horizon multi-object non-prehensile rearrangement for mobile robots using a single egocentric camera. We introduce EgoPush, a policy learning framework that enables egocentric, perception-driven rearrangement without relying on explicit global state estimation that often fails in dynamic scenes. EgoPush designs an object-centric latent space to encode relative spatial relations among objects, rather than absolute poses. This design enables a privileged reinforcement-learning (RL) teacher to jointly learn latent states and mobile actions from sparse keypoints, which is then distilled into a purely visual student policy. To reduce the supervision gap between the omniscient teacher and the partially observed student, we restrict the teacher's observations to visually accessible cues. This induces active perception behaviors that are recoverable from the student's viewpoint. To address long-horizon credit assignment, we decompose rearrangement into stage-level subproblems using temporally decayed, stage-local completion rewards. Extensive simulation experiments demonstrate that EgoPush significantly outperforms end-to-end RL baselines in success rate, with ablation studies validating each design choice. We further demonstrate zero-shot sim-to-real transfer on a mobile platform in the real world. Code and videos are available at https://ai4ce.github.io/EgoPush/.
Paper Structure (35 sections, 24 equations, 12 figures, 7 tables)

This paper contains 35 sections, 24 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: EgoPush Overview. EgoPush is a two-phase learning framework for long-horizon, multi-object non-prehensile rearrangement under egocentric observations: in Phase 1, a privileged teacher policy is trained from sparse keypoints while enforcing egocentric, visibility-limited sensing so its behaviors remain visually recoverable; in Phase 2, an egocentric student uses RGB only for instance grouping and receives group-wise depth inputs, and is distilled online from the teacher via latent and action regression, enabling zero-shot sim-to-real deployment on a TurtleBot with a RealSense camera.
  • Figure 2: Comparative analysis of teacher observation spaces.
  • Figure 3: Training curves for the credit assignment ablations.
  • Figure 4: Distillation training curve for distillation ablation.
  • Figure 5: Real-world hardware setup.
  • ...and 7 more figures