Table of Contents
Fetching ...

Wild Visual Navigation: Fast Traversability Learning via Pre-Trained Models and Online Self-Supervision

Matías Mattamala, Jonas Frey, Piotr Libera, Nived Chebrolu, Georg Martius, Cesar Cadena, Marco Hutter, Maurice Fallon

TL;DR

Wild Visual Navigation (WVN) tackles autonomous outdoor navigation by learning visual traversability online from a brief human demonstration. It leverages high-dimensional, pre-trained self-supervised features (e.g., DINO-ViT, STEGO) and an online supervision generator to train traversability models concurrently with inference, enabling rapid adaptation in forests, parks, and grasslands. A dual-graph supervision framework and anomaly-informed confidence enable robust online learning, while integration with a local terrain map and reactive planning supports closed-loop autonomous navigation. Real-world deployments demonstrate fast in-field adaptation (under minutes), superior handling of natural terrain compared to purely geometric methods, and kilometer-scale autonomous path following, highlighting practical impact for legged robots operating in complex environments.

Abstract

Natural environments such as forests and grasslands are challenging for robotic navigation because of the false perception of rigid obstacles from high grass, twigs, or bushes. In this work, we present Wild Visual Navigation (WVN), an online self-supervised learning system for visual traversability estimation. The system is able to continuously adapt from a short human demonstration in the field, only using onboard sensing and computing. One of the key ideas to achieve this is the use of high-dimensional features from pre-trained self-supervised models, which implicitly encode semantic information that massively simplifies the learning task. Further, the development of an online scheme for supervision generator enables concurrent training and inference of the learned model in the wild. We demonstrate our approach through diverse real-world deployments in forests, parks, and grasslands. Our system is able to bootstrap the traversable terrain segmentation in less than 5 min of in-field training time, enabling the robot to navigate in complex, previously unseen outdoor terrains. Code: https://bit.ly/498b0CV - Project page:https://bit.ly/3M6nMHH

Wild Visual Navigation: Fast Traversability Learning via Pre-Trained Models and Online Self-Supervision

TL;DR

Wild Visual Navigation (WVN) tackles autonomous outdoor navigation by learning visual traversability online from a brief human demonstration. It leverages high-dimensional, pre-trained self-supervised features (e.g., DINO-ViT, STEGO) and an online supervision generator to train traversability models concurrently with inference, enabling rapid adaptation in forests, parks, and grasslands. A dual-graph supervision framework and anomaly-informed confidence enable robust online learning, while integration with a local terrain map and reactive planning supports closed-loop autonomous navigation. Real-world deployments demonstrate fast in-field adaptation (under minutes), superior handling of natural terrain compared to purely geometric methods, and kilometer-scale autonomous path following, highlighting practical impact for legged robots operating in complex environments.

Abstract

Natural environments such as forests and grasslands are challenging for robotic navigation because of the false perception of rigid obstacles from high grass, twigs, or bushes. In this work, we present Wild Visual Navigation (WVN), an online self-supervised learning system for visual traversability estimation. The system is able to continuously adapt from a short human demonstration in the field, only using onboard sensing and computing. One of the key ideas to achieve this is the use of high-dimensional features from pre-trained self-supervised models, which implicitly encode semantic information that massively simplifies the learning task. Further, the development of an online scheme for supervision generator enables concurrent training and inference of the learned model in the wild. We demonstrate our approach through diverse real-world deployments in forests, parks, and grasslands. Our system is able to bootstrap the traversable terrain segmentation in less than 5 min of in-field training time, enabling the robot to navigate in complex, previously unseen outdoor terrains. Code: https://bit.ly/498b0CV - Project page:https://bit.ly/3M6nMHH
Paper Structure (39 sections, 7 equations, 12 figures, 1 table)

This paper contains 39 sections, 7 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: learns to predict traversability from images via online self-supervised learning. Starting from a randomly initialized traversability estimation network without prior assumptions about the environment (a), a human operator drives the robot around areas that are traversable for the given platform (b). After a few minutes of operation, learns to distinguish between traversable (blue $\blacksquare$) and untraversable (red $\blacksquare$) areas (c), enabling the robot to navigate autonomously and safely within the environment (d).
  • Figure 2: System overview: only requires monocular RGB images, odometry, and proprioceptive data as input, which are processed to extract features and supervision signals used for online learning and inference of traversability (see Sec. \ref{['sec:Method']}).
  • Figure 3: Feature Extraction & Inference process: The camera scheduler module (Sec. \ref{['subsubsec:multi-camera']}) selects one camera from the available pool, and provides the RGB image to the feature extractor module (Sec. \ref{['subsubsec:feature-extraction']}). This extracts dense visual features $\mathbf{F}^{}$ using pre-trained models. Next, the sub-sample module produces a reduced set of embeddings $\{ \mathbf{f}_{n} \}$ using a subsampling strategy based on a weak segmentation system (Sec. \ref{['subsubsec:feature-subsampling']}). Lastly, the inference module predicts traversability from the image using the embeddings.
  • Figure 4: Comparison feature segmentation methods for 3 example images. SLIC over-segments the image, but fails to construct semantically coherent segments (e.g. top row merging fence and ground into a single segment). The STEGO segmentation aligns with the semantics, but the computation of prototype vectors across a full dataset limits the number of semantic classes, leading to merging of two semantic classes into a single segment (grass and walkway, bottom row). Our modified version of STEGO, over-segments the image but still provides semantically meaningful segments without pre-setting prototype vectors before deployment.
  • Figure 5: Supervision and mission graphs: (a) Information stored in each graph over the mission. While the Supervision Graph only stores temporary information about the robot's footprint in a sliding window, the Mission Graph saves the data required for online learning over the full mission. The color of the footprint patches indicates the generated traversability score. (b) The interaction between graphs updates the traversability in the mission nodes by reprojecting the robot's footprint and traversability scores.
  • ...and 7 more figures