Table of Contents
Fetching ...

Beyond Egocentric Limits: Multi-View Depth-Based Learning for Robust Quadrupedal Locomotion

Rémy Rahem, Wael Suleiman

TL;DR

This work tackles the fragility of egocentric perception in dynamic quadrupedal locomotion by introducing a multi-view depth-based framework that fuses onboard and remote depth streams through a teacher-student distillation pipeline. A dual-depth policy is trained with privileged information in the teacher phase and distills to a robust student capable of handling remote-view dropout and misalignment via extensive domain randomization. Results show that multi-view policies outperform single-view baselines in challenging tasks and maintain stability when exocentric inputs are partially unavailable, with RD during training crucial for resilience. The approach supports aerial-ground cooperative sensing and enhances sim-to-real transfer, offering a practical path toward perception-rich, robust legged locomotion.

Abstract

Recent progress in legged locomotion has allowed highly dynamic and parkour-like behaviors for robots, similar to their biological counterparts. Yet, these methods mostly rely on egocentric (first-person) perception, limiting their performance, especially when the viewpoint of the robot is occluded. A promising solution would be to enhance the robot's environmental awareness by using complementary viewpoints, such as multiple actors exchanging perceptual information. Inspired by this idea, this work proposes a multi-view depth-based locomotion framework that combines egocentric and exocentric observations to provide richer environmental context during agile locomotion. Using a teacher-student distillation approach, the student policy learns to fuse proprioception with dual depth streams while remaining robust to real-world sensing imperfections. To further improve robustness, we introduce extensive domain randomization, including stochastic remote-camera dropouts and 3D positional perturbations that emulate aerial-ground cooperative sensing. Simulation results show that multi-viewpoints policies outperform single-viewpoint baseline in gap crossing, step descent, and other dynamic maneuvers, while maintaining stability when the exocentric camera is partially or completely unavailable. Additional experiments show that moderate viewpoint misalignment is well tolerated when incorporated during training. This study demonstrates that heterogeneous visual feedback improves robustness and agility in quadrupedal locomotion. Furthermore, to support reproducibility, the implementation accompanying this work is publicly available at https://anonymous.4open.science/r/multiview-parkour-6FB8

Beyond Egocentric Limits: Multi-View Depth-Based Learning for Robust Quadrupedal Locomotion

TL;DR

This work tackles the fragility of egocentric perception in dynamic quadrupedal locomotion by introducing a multi-view depth-based framework that fuses onboard and remote depth streams through a teacher-student distillation pipeline. A dual-depth policy is trained with privileged information in the teacher phase and distills to a robust student capable of handling remote-view dropout and misalignment via extensive domain randomization. Results show that multi-view policies outperform single-view baselines in challenging tasks and maintain stability when exocentric inputs are partially unavailable, with RD during training crucial for resilience. The approach supports aerial-ground cooperative sensing and enhances sim-to-real transfer, offering a practical path toward perception-rich, robust legged locomotion.

Abstract

Recent progress in legged locomotion has allowed highly dynamic and parkour-like behaviors for robots, similar to their biological counterparts. Yet, these methods mostly rely on egocentric (first-person) perception, limiting their performance, especially when the viewpoint of the robot is occluded. A promising solution would be to enhance the robot's environmental awareness by using complementary viewpoints, such as multiple actors exchanging perceptual information. Inspired by this idea, this work proposes a multi-view depth-based locomotion framework that combines egocentric and exocentric observations to provide richer environmental context during agile locomotion. Using a teacher-student distillation approach, the student policy learns to fuse proprioception with dual depth streams while remaining robust to real-world sensing imperfections. To further improve robustness, we introduce extensive domain randomization, including stochastic remote-camera dropouts and 3D positional perturbations that emulate aerial-ground cooperative sensing. Simulation results show that multi-viewpoints policies outperform single-viewpoint baseline in gap crossing, step descent, and other dynamic maneuvers, while maintaining stability when the exocentric camera is partially or completely unavailable. Additional experiments show that moderate viewpoint misalignment is well tolerated when incorporated during training. This study demonstrates that heterogeneous visual feedback improves robustness and agility in quadrupedal locomotion. Furthermore, to support reproducibility, the implementation accompanying this work is publicly available at https://anonymous.4open.science/r/multiview-parkour-6FB8

Paper Structure

This paper contains 20 sections, 8 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The two-phase training: In phase 1, the teacher (top) uses proprioception and privileged information (blue) to train an extrinsic estimator (EE) and an actor to produce the desired actions, i.e. joint position targets. In phase 2, the student (bottom) improves the actor by imitating the teacher using onboard and remote depth streams, proprioception, and the pretrained EE.
  • Figure 2: Illustration of the positional randomization of the remote camera within a spherical region of radius $R_s^t$.
  • Figure 3: Mean training reward across all models. A moving average with a window size of 100 is applied for clarity.
  • Figure 4: Comparison of the baseline model (a) and the combined vision model trained without RD (b) when executing a locomotion sequence to jump over a gap, using the perception data shown in (c). The baseline model (a) underestimates the gap distance, resulting in an insufficient jump. In contrast, the combined-vision model (b) accurately estimates the required distance and successfully performs the jump.
  • Figure 5: Comparison between the combined-vision model trained without (a) and with RD (b) when executing a locomotion sequence to descend a step while experiencing RD. In the model trained without RD (a), the robot over-relies on the exocentric view, leading to catastrophic failure when RD occurs. In contrast, the model trained with RD (b) learns to integrate both viewpoints robustly, enabling safe and stable descent even under RD conditions.
  • ...and 1 more figures