Table of Contents
Fetching ...

World Model-based Perception for Visual Legged Locomotion

Hang Lai, Jiahang Cao, Jiafeng Xu, Hongtao Wu, Yunfeng Lin, Tao Kong, Yong Yu, Weinan Zhang

TL;DR

A simple yet effective method, World Model-based Perception (WMP), which builds a world model of the environment and learns a policy based on the world model, illustrating that though completely trained in simulation, the world model can make accurate predictions of real-world trajectories, thus providing informative signals for the policy controller.

Abstract

Legged locomotion over various terrains is challenging and requires precise perception of the robot and its surroundings from both proprioception and vision. However, learning directly from high-dimensional visual input is often data-inefficient and intricate. To address this issue, traditional methods attempt to learn a teacher policy with access to privileged information first and then learn a student policy to imitate the teacher's behavior with visual input. Despite some progress, this imitation framework prevents the student policy from achieving optimal performance due to the information gap between inputs. Furthermore, the learning process is unnatural since animals intuitively learn to traverse different terrains based on their understanding of the world without privileged knowledge. Inspired by this natural ability, we propose a simple yet effective method, World Model-based Perception (WMP), which builds a world model of the environment and learns a policy based on the world model. We illustrate that though completely trained in simulation, the world model can make accurate predictions of real-world trajectories, thus providing informative signals for the policy controller. Extensive simulated and real-world experiments demonstrate that WMP outperforms state-of-the-art baselines in traversability and robustness. Videos and Code are available at: https://wmp-loco.github.io/.

World Model-based Perception for Visual Legged Locomotion

TL;DR

A simple yet effective method, World Model-based Perception (WMP), which builds a world model of the environment and learns a policy based on the world model, illustrating that though completely trained in simulation, the world model can make accurate predictions of real-world trajectories, thus providing informative signals for the policy controller.

Abstract

Legged locomotion over various terrains is challenging and requires precise perception of the robot and its surroundings from both proprioception and vision. However, learning directly from high-dimensional visual input is often data-inefficient and intricate. To address this issue, traditional methods attempt to learn a teacher policy with access to privileged information first and then learn a student policy to imitate the teacher's behavior with visual input. Despite some progress, this imitation framework prevents the student policy from achieving optimal performance due to the information gap between inputs. Furthermore, the learning process is unnatural since animals intuitively learn to traverse different terrains based on their understanding of the world without privileged knowledge. Inspired by this natural ability, we propose a simple yet effective method, World Model-based Perception (WMP), which builds a world model of the environment and learns a policy based on the world model. We illustrate that though completely trained in simulation, the world model can make accurate predictions of real-world trajectories, thus providing informative signals for the policy controller. Extensive simulated and real-world experiments demonstrate that WMP outperforms state-of-the-art baselines in traversability and robustness. Videos and Code are available at: https://wmp-loco.github.io/.
Paper Structure (12 sections, 10 equations, 7 figures, 1 table)

This paper contains 12 sections, 10 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Scandots (Left) and the corresponding depth images (Right). Top: Sparse scandotscan not distinguish the precise distance to the boundaries (indicated in green), leading to a collision with the left barrier. Bottom: Scandots can not represent off-ground obstacles. In contrast, depth images can represent these terrains well.
  • Figure 2: Illustration of the WMP framework. The world model runs at a lower frequency than policy, with an update interval of $k$ timesteps.
  • Figure 3: T-SNE result of recurrent state $h_t$ over six different terrains.
  • Figure 4: Average return over different world model interval $k$ (Left) and different lengths of training data (Right).
  • Figure 5: Real-world depth images and long-term predictions of depth images using the world model.
  • ...and 2 more figures