Table of Contents
Fetching ...

Legged Locomotion in Challenging Terrains using Egocentric Vision

Ananye Agarwal, Ashish Kumar, Jitendra Malik, Deepak Pathak

TL;DR

This work addresses robust legged locomotion on challenging terrains using egocentric depth vision by learning an end-to-end policy without explicit elevation maps. It deploys a two-phase training pipeline: phase 1 uses reinforcement learning with scandots to learn a memory-informed policy, and phase 2 distills to a depth-based deployment policy via supervised learning, yielding two architectures (Monolithic GRU and RMA) for memory and sensor integration. The approach achieves real-time performance on a small quadruped, demonstrating effective traversal of stairs, curbs, gaps, stepping stones, and natural environments, with strong sim-to-real transfer and resilience to perturbations; the work also provides a theoretical guarantee (Theorem 3.1) on the phase-2 performance bound given phase-1 optimality and phase-2 closeness. Overall, the results show that end-to-end egocentric depth control can match or exceed map-based foothold strategies in both simulation and real-world trials, enabling small, low-cost robots to navigate complex terrains without heavy perception pipelines. The contribution significantly advances vision-guided locomotion by removing reliance on elevation maps and enabling emergent, contact-robust gaits through learned policies.

Abstract

Animals are capable of precise and agile locomotion using vision. Replicating this ability has been a long-standing goal in robotics. The traditional approach has been to decompose this problem into elevation mapping and foothold planning phases. The elevation mapping, however, is susceptible to failure and large noise artifacts, requires specialized hardware, and is biologically implausible. In this paper, we present the first end-to-end locomotion system capable of traversing stairs, curbs, stepping stones, and gaps. We show this result on a medium-sized quadruped robot using a single front-facing depth camera. The small size of the robot necessitates discovering specialized gait patterns not seen elsewhere. The egocentric camera requires the policy to remember past information to estimate the terrain under its hind feet. We train our policy in simulation. Training has two phases - first, we train a policy using reinforcement learning with a cheap-to-compute variant of depth image and then in phase 2 distill it into the final policy that uses depth using supervised learning. The resulting policy transfers to the real world and is able to run in real-time on the limited compute of the robot. It can traverse a large variety of terrain while being robust to perturbations like pushes, slippery surfaces, and rocky terrain. Videos are at https://vision-locomotion.github.io

Legged Locomotion in Challenging Terrains using Egocentric Vision

TL;DR

This work addresses robust legged locomotion on challenging terrains using egocentric depth vision by learning an end-to-end policy without explicit elevation maps. It deploys a two-phase training pipeline: phase 1 uses reinforcement learning with scandots to learn a memory-informed policy, and phase 2 distills to a depth-based deployment policy via supervised learning, yielding two architectures (Monolithic GRU and RMA) for memory and sensor integration. The approach achieves real-time performance on a small quadruped, demonstrating effective traversal of stairs, curbs, gaps, stepping stones, and natural environments, with strong sim-to-real transfer and resilience to perturbations; the work also provides a theoretical guarantee (Theorem 3.1) on the phase-2 performance bound given phase-1 optimality and phase-2 closeness. Overall, the results show that end-to-end egocentric depth control can match or exceed map-based foothold strategies in both simulation and real-world trials, enabling small, low-cost robots to navigate complex terrains without heavy perception pipelines. The contribution significantly advances vision-guided locomotion by removing reliance on elevation maps and enabling emergent, contact-robust gaits through learned policies.

Abstract

Animals are capable of precise and agile locomotion using vision. Replicating this ability has been a long-standing goal in robotics. The traditional approach has been to decompose this problem into elevation mapping and foothold planning phases. The elevation mapping, however, is susceptible to failure and large noise artifacts, requires specialized hardware, and is biologically implausible. In this paper, we present the first end-to-end locomotion system capable of traversing stairs, curbs, stepping stones, and gaps. We show this result on a medium-sized quadruped robot using a single front-facing depth camera. The small size of the robot necessitates discovering specialized gait patterns not seen elsewhere. The egocentric camera requires the policy to remember past information to estimate the terrain under its hind feet. We train our policy in simulation. Training has two phases - first, we train a policy using reinforcement learning with a cheap-to-compute variant of depth image and then in phase 2 distill it into the final policy that uses depth using supervised learning. The resulting policy transfers to the real world and is able to run in real-time on the limited compute of the robot. It can traverse a large variety of terrain while being robust to perturbations like pushes, slippery surfaces, and rocky terrain. Videos are at https://vision-locomotion.github.io
Paper Structure (32 sections, 2 theorems, 14 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 32 sections, 2 theorems, 14 equations, 7 figures, 2 tables, 1 algorithm.

Key Result

Theorem 2.1

$\mathcal{M} = \left(\mathcal{S}, \mathcal{A}, P, R, \gamma\right)$ be an MDP with state space $\mathcal{S}$, action space $\mathcal{A}$, transition function $P:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{S}$, reward function $R:\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R}$ and discount fac

Figures (7)

  • Figure 1: Our robot can traverse a variety of challenging terrain in indoor and outdoor environments, urban and natural settings during day and night using a single front-facing depth camera. The robot can traverse curbs, stairs and moderately rocky terrain. Despite being much smaller than other commonly used legged robots, it is able to climb stairs and curbs of a similar height. Videos at https://vision-locomotion.github.io
  • Figure 2: A smaller robot (a) faces challenges in climbing stairs and curbs due to the stair obstructing its feet while going up and a tendency to topple over when coming down (b). Our robot deals with this by climbing using a large hip abduction that automatically emerges during training (c).
  • Figure 3: We train our locomotion policy in two phases to avoid rendering depth for too many samples. In phase 1, we use RL to train a policy $\pi^1$ that has access to scandots that are cheap to compute. In phase 2, we use $\pi^1$ to provide ground truth actions which another policy $\pi^2$ is trained to imitate. This student has access to depth map from the front camera. We consider two architectures (1) a monolithic one which is a GRU trained to output joint angles with raw observations as input (2) a decoupled architecture trained using RMA rma that is trained to estimate vision and proprioception latents that condition a base feedforward walking policy.
  • Figure 4: We show success rates and time-to-failure (TTF) for our method and the blind baseline on curbs, stairs, stepping stones and gaps. We use a separate policy for stairs which is distilled to front camera, and use a separate policy trained on stepping stones distilled to the top camera which we use for gaps and stepping stones. We observe that our method solves all the tasks perfectly except for the stepping stone task in which the robot achieves 94% success. The blind baseline fails completely on gaps and stepping stones. For upstairs, it makes some progress, but fails to complete the entire staircase even once, which is expected given the small size of the robot. The blind policy completes the downstairs task 100% success, although it learns a very high impact falling gait to solve the task. In our experiments, the robot dislocates its real right leg during the blind downstairs trials.
  • Figure 5: Set of terrain we use during training
  • ...and 2 more figures

Theorems & Definitions (3)

  • Theorem 2.1
  • Theorem
  • proof