Table of Contents
Fetching ...

Watch Your STEPP: Semantic Traversability Estimation using Pose Projected Features

Sebastian Ægidius, Dennis Hadjivelichkov, Jianhao Jiao, Jonathan Embley-Riches, Dimitrios Kanoulas

TL;DR

STEPP tackles the challenge of estimating terrain traversability for legged robots in unstructured environments by leveraging dense pose-projected features from a pre-trained Vision Transformer (DINOv2) and an encoder–decoder MLP trained with reconstruction loss. The approach uses both real-world human-walking data and Unreal Engine synthetic data to learn a robust traversability distribution, which is projected into 3D for integration with a local planner. Key contributions include a complete data pipeline with pose projection, a segmentation-and-feature extraction scheme yielding 384-d embeddings, and demonstration on indoor mazes and outdoor forests with the ANYmal platform. The results show favorable accuracy gains when combining diverse training data and highlight STEPP’s potential for robust, anomaly-aware navigation in real time, while also noting limitations related to inference speed and depth accuracy that point to avenues for improvement.

Abstract

Understanding the traversability of terrain is essential for autonomous robot navigation, particularly in unstructured environments such as natural landscapes. Although traditional methods, such as occupancy mapping, provide a basic framework, they often fail to account for the complex mobility capabilities of some platforms such as legged robots. In this work, we propose a method for estimating terrain traversability by learning from demonstrations of human walking. Our approach leverages dense, pixel-wise feature embeddings generated using the DINOv2 vision Transformer model, which are processed through an encoder-decoder MLP architecture to analyze terrain segments. The averaged feature vectors, extracted from the masked regions of interest, are used to train the model in a reconstruction-based framework. By minimizing reconstruction loss, the network distinguishes between familiar terrain with a low reconstruction error and unfamiliar or hazardous terrain with a higher reconstruction error. This approach facilitates the detection of anomalies, allowing a legged robot to navigate more effectively through challenging terrain. We run real-world experiments on the ANYmal legged robot both indoor and outdoor to prove our proposed method. The code is open-source, while video demonstrations can be found on our website: https://rpl-cs-ucl.github.io/STEPP

Watch Your STEPP: Semantic Traversability Estimation using Pose Projected Features

TL;DR

STEPP tackles the challenge of estimating terrain traversability for legged robots in unstructured environments by leveraging dense pose-projected features from a pre-trained Vision Transformer (DINOv2) and an encoder–decoder MLP trained with reconstruction loss. The approach uses both real-world human-walking data and Unreal Engine synthetic data to learn a robust traversability distribution, which is projected into 3D for integration with a local planner. Key contributions include a complete data pipeline with pose projection, a segmentation-and-feature extraction scheme yielding 384-d embeddings, and demonstration on indoor mazes and outdoor forests with the ANYmal platform. The results show favorable accuracy gains when combining diverse training data and highlight STEPP’s potential for robust, anomaly-aware navigation in real time, while also noting limitations related to inference speed and depth accuracy that point to avenues for improvement.

Abstract

Understanding the traversability of terrain is essential for autonomous robot navigation, particularly in unstructured environments such as natural landscapes. Although traditional methods, such as occupancy mapping, provide a basic framework, they often fail to account for the complex mobility capabilities of some platforms such as legged robots. In this work, we propose a method for estimating terrain traversability by learning from demonstrations of human walking. Our approach leverages dense, pixel-wise feature embeddings generated using the DINOv2 vision Transformer model, which are processed through an encoder-decoder MLP architecture to analyze terrain segments. The averaged feature vectors, extracted from the masked regions of interest, are used to train the model in a reconstruction-based framework. By minimizing reconstruction loss, the network distinguishes between familiar terrain with a low reconstruction error and unfamiliar or hazardous terrain with a higher reconstruction error. This approach facilitates the detection of anomalies, allowing a legged robot to navigate more effectively through challenging terrain. We run real-world experiments on the ANYmal legged robot both indoor and outdoor to prove our proposed method. The code is open-source, while video demonstrations can be found on our website: https://rpl-cs-ucl.github.io/STEPP

Paper Structure

This paper contains 17 sections, 6 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: An illustration of STEPP with robot and traversability costs overlaid. Blue -- traversable, Red -- Untraversable.
  • Figure 2: The full pipeline from creating training data to training model as well as the pipeline for STEPP at inference.
  • Figure 3: Samples of the data recorded from walking in different environments used with pose projected path on it.
  • Figure 4: Unreal Engine simulation environment data used for training. Red line shows the trajectory the simulation camera takes projected onto the ground.
  • Figure 5: STEPP reconstruction loss on unseen forest data. a) sample environment, b) traversability ground truth, c) STEPP traversability cost feature segment wise, d) STEPP traversability cost feature pixel wise. Segments and pixels that are the darkest red is considered not traversable.
  • ...and 2 more figures