Table of Contents
Fetching ...

Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers

Ruihan Yang, Minghao Zhang, Nicklas Hansen, Huazhe Xu, Xiaolong Wang

TL;DR

This work addresses robust quadrupedal locomotion by integrating proprioceptive signals with visual depth data through a cross-modal Transformer (LocoTransformer) trained end-to-end with PPO. The model uses separate encoders for each modality and a Transformer encoder to fuse features via self-attention, enabling both short-term reactions and long-term planning. Across diverse simulated and real-world tasks, it shows superior performance, better generalization to unseen terrains and obstacles, and successful sim-to-real transfer compared to baselines that rely on proprioception or simple fusion. The results highlight the value of attention-based fusion for multi-modal RL in complex locomotion settings and suggest broad applicability to real-world legged robots.

Abstract

We propose to address quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs. While learning-based locomotion has made great advances using RL, most methods still rely on domain randomization for training blind agents that generalize to challenging terrains. Our key insight is that proprioceptive states only offer contact measurements for immediate reaction, whereas an agent equipped with visual sensory observations can learn to proactively maneuver environments with obstacles and uneven terrain by anticipating changes in the environment many steps ahead. In this paper, we introduce LocoTransformer, an end-to-end RL method that leverages both proprioceptive states and visual observations for locomotion control. We evaluate our method in challenging simulated environments with different obstacles and uneven terrain. We transfer our learned policy from simulation to a real robot by running it indoors and in the wild with unseen obstacles and terrain. Our method not only significantly improves over baselines, but also achieves far better generalization performance, especially when transferred to the real robot. Our project page with videos is at https://rchalyang.github.io/LocoTransformer/ .

Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers

TL;DR

This work addresses robust quadrupedal locomotion by integrating proprioceptive signals with visual depth data through a cross-modal Transformer (LocoTransformer) trained end-to-end with PPO. The model uses separate encoders for each modality and a Transformer encoder to fuse features via self-attention, enabling both short-term reactions and long-term planning. Across diverse simulated and real-world tasks, it shows superior performance, better generalization to unseen terrains and obstacles, and successful sim-to-real transfer compared to baselines that rely on proprioception or simple fusion. The results highlight the value of attention-based fusion for multi-modal RL in complex locomotion settings and suggest broad applicability to real-world legged robots.

Abstract

We propose to address quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs. While learning-based locomotion has made great advances using RL, most methods still rely on domain randomization for training blind agents that generalize to challenging terrains. Our key insight is that proprioceptive states only offer contact measurements for immediate reaction, whereas an agent equipped with visual sensory observations can learn to proactively maneuver environments with obstacles and uneven terrain by anticipating changes in the environment many steps ahead. In this paper, we introduce LocoTransformer, an end-to-end RL method that leverages both proprioceptive states and visual observations for locomotion control. We evaluate our method in challenging simulated environments with different obstacles and uneven terrain. We transfer our learned policy from simulation to a real robot by running it indoors and in the wild with unseen obstacles and terrain. Our method not only significantly improves over baselines, but also achieves far better generalization performance, especially when transferred to the real robot. Our project page with videos is at https://rchalyang.github.io/LocoTransformer/ .

Paper Structure

This paper contains 35 sections, 5 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Overview of simulated environments & real robot trajectories.Top row shows the simulated environments. For each sample, the left image is the environment and the right image is the corresponding observation. Agents are tasked to move forward while avoiding black obstacles and collecting red spheres. Following two rows show the deployment of the RL policy to a real robot in an indoor hallway with boxes and a forest with trees. Our robot successfully utilizes the visual information to traverse the complex environments.
  • Figure 2: Network Architecture. We process proprioceptive states with a MLP and depth images with a ConvNet. We take proprioceptive embedding as a single token, split the spatial visual feature representation into $N \times N$ tokens and feed all tokens into the Transformer encoder. The output tokens are further processed by the projection head to predict value or action distribution.
  • Figure 3: Self-attention from our shared Transformer module. We visualize the self-attention between the proprioceptive token and all visual tokens in the last layer of our Transformer model. We plot the attention weight over raw visual input where warmer color represents larger attention weight.
  • Figure 4: Training and evaluation curves on simulated environments (Concrete lines and shaded areas shows the mean and the std over 5 seeds, respectively). For environment without sphere (in (a)), our method achieve comparable training performance but much better evaluation performance on unseen environments (in (b)). For more challenging environment (in (c) and (d)) our method achieve better performance and sample efficiency.
  • Figure 5: Real World Samples We evaluate our method in real-world scenarios with different obstacles on complex terrain.
  • ...and 5 more figures