Table of Contents
Fetching ...

MOVE: Multi-skill Omnidirectional Legged Locomotion with Limited View in 3D Environments

Songbo Li, Shixin Luo, Jun Wu, Qiuguo Zhu

TL;DR

The paper addresses omnidirectional legged locomotion for low-cost robots with limited egocentric vision. It introduces MOVE, a one-stage end-to-end framework built on PS-Net that fuses reconstruction and contrastive learning to infer unseen surroundings from a cube-map privileged representation, enabling robust motion across 3D terrains. The architecture comprises a standard input encoder, a surroundings encoder, a policy network, and a value network, with an asymmetric attention mechanism and a mixed supervision objective. Experimental results in simulation and on a real Lite3 robot demonstrate strong performance across forward and omnidirectional tasks (jumps, climbs, crawls) even under depth noise and partial occlusions, highlighting sim-to-real transfer. This work broadens the operational scope of egocentric-vision quadrupeds and provides a practical path toward real-time omnidirectional locomotion in challenging 3D environments.

Abstract

Legged robots possess inherent advantages in traversing complex 3D terrains. However, previous work on low-cost quadruped robots with egocentric vision systems has been limited by a narrow front-facing view and exteroceptive noise, restricting omnidirectional mobility in such environments. While building a voxel map through a hierarchical structure can refine exteroception processing, it introduces significant computational overhead, noise, and delays. In this paper, we present MOVE, a one-stage end-to-end learning framework capable of multi-skill omnidirectional legged locomotion with limited view in 3D environments, just like what a real animal can do. When movement aligns with the robot's line of sight, exteroceptive perception enhances locomotion, enabling extreme climbing and leaping. When vision is obstructed or the direction of movement lies outside the robot's field of view, the robot relies on proprioception for tasks like crawling and climbing stairs. We integrate all these skills into a single neural network by introducing a pseudo-siamese network structure combining supervised and contrastive learning which helps the robot infer its surroundings beyond its field of view. Experiments in both simulations and real-world scenarios demonstrate the robustness of our method, broadening the operational environments for robotics with egocentric vision.

MOVE: Multi-skill Omnidirectional Legged Locomotion with Limited View in 3D Environments

TL;DR

The paper addresses omnidirectional legged locomotion for low-cost robots with limited egocentric vision. It introduces MOVE, a one-stage end-to-end framework built on PS-Net that fuses reconstruction and contrastive learning to infer unseen surroundings from a cube-map privileged representation, enabling robust motion across 3D terrains. The architecture comprises a standard input encoder, a surroundings encoder, a policy network, and a value network, with an asymmetric attention mechanism and a mixed supervision objective. Experimental results in simulation and on a real Lite3 robot demonstrate strong performance across forward and omnidirectional tasks (jumps, climbs, crawls) even under depth noise and partial occlusions, highlighting sim-to-real transfer. This work broadens the operational scope of egocentric-vision quadrupeds and provides a practical path toward real-time omnidirectional locomotion in challenging 3D environments.

Abstract

Legged robots possess inherent advantages in traversing complex 3D terrains. However, previous work on low-cost quadruped robots with egocentric vision systems has been limited by a narrow front-facing view and exteroceptive noise, restricting omnidirectional mobility in such environments. While building a voxel map through a hierarchical structure can refine exteroception processing, it introduces significant computational overhead, noise, and delays. In this paper, we present MOVE, a one-stage end-to-end learning framework capable of multi-skill omnidirectional legged locomotion with limited view in 3D environments, just like what a real animal can do. When movement aligns with the robot's line of sight, exteroceptive perception enhances locomotion, enabling extreme climbing and leaping. When vision is obstructed or the direction of movement lies outside the robot's field of view, the robot relies on proprioception for tasks like crawling and climbing stairs. We integrate all these skills into a single neural network by introducing a pseudo-siamese network structure combining supervised and contrastive learning which helps the robot infer its surroundings beyond its field of view. Experiments in both simulations and real-world scenarios demonstrate the robustness of our method, broadening the operational environments for robotics with egocentric vision.

Paper Structure

This paper contains 19 sections, 7 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: We deploy our policy in real-world environments, demonstrating the diverse and exceptional motion skills of the proposed framework MOVE. Our robot utilizes limited visual perception, successfully traversing complex 3D environments omnidirectionally. Even when the exteroception was severely disrupted or unavailable, the vision-dependent policy still enables the robot to overcome obstacles.
  • Figure 2: Overview of the proposed MOVE framework. We use a one-stage learning pipeline to train a comprehensive locomotion policy with access to limited visual perception. The left sides of the figure illustrates the composition of two types of the visual input. The blue box represents PS-Net, which is trained by a combination of supervised and unsupervised learning method.
  • Figure 3: The detailed architecture of PS-Net, consisting two main parts: standard input encoder and surroundings encoder. Utilizing a pseudo-siamese network structure, PS-Net is able to extract shared feature between standard input and priviledged information even from incomplete and noisy observations.
  • Figure 4: Asymmetric attention mechanism in PS-Net. PS-Net incorporates an asymmetric attention mechanism to effectively process different input modalities. The standard input encoder employs a self-attention mechanism to fuse multimodal information from proprioception and depth images. Conversely, the surroundings encoder utilizes a cross-attention module to focus on privileged visual inputs to help standard input encoder extract surrounding visual features by contrastive learning.
  • Figure 5: The average per-channel std of the ${\ell}_2$-normalized $\hat{\mathbf{z}}_t^c$. If $\hat{\mathbf{z}}_t^c$ follows a zero-mean isotropic Gaussian distribution, the standard deviation of ${\ell}_2$-normalized $\hat{\mathbf{z}}_t^c$ is expected to be approximately $1/\sqrt{d}$, where $d$ represents the dimension of $\hat{\mathbf{z}}_t^c$ along the channel axis. In our case, $d = 16$. A noticeable degeneration of its std from this value suggests a degree of representation collapse.
  • ...and 4 more figures