Table of Contents
Fetching ...

V-STRONG: Visual Self-Supervised Traversability Learning for Off-road Navigation

Sanghun Jung, JoonHo Lee, Xiangyun Meng, Byron Boots, Alexander Lambert

TL;DR

A novel, image-based self-supervised learning method for traversability prediction, leveraging a state-of-the-art vision foundation model for improved out-of-distribution performance and demonstrating unprecedented performance for generalization to new environments.

Abstract

Reliable estimation of terrain traversability is critical for the successful deployment of autonomous systems in wild, outdoor environments. Given the lack of large-scale annotated datasets for off-road navigation, strictly-supervised learning approaches remain limited in their generalization ability. To this end, we introduce a novel, image-based self-supervised learning method for traversability prediction, leveraging a state-of-the-art vision foundation model for improved out-of-distribution performance. Our method employs contrastive representation learning using both human driving data and instance-based segmentation masks during training. We show that this simple, yet effective, technique drastically outperforms recent methods in predicting traversability for both on- and off-trail driving scenarios. We compare our method with recent baselines on both a common benchmark as well as our own datasets, covering a diverse range of outdoor environments and varied terrain types. We also demonstrate the compatibility of resulting costmap predictions with a model-predictive controller. Finally, we evaluate our approach on zero- and few-shot tasks, demonstrating unprecedented performance for generalization to new environments. Videos and additional material can be found here: https://sites.google.com/view/visual-traversability-learning.

V-STRONG: Visual Self-Supervised Traversability Learning for Off-road Navigation

TL;DR

A novel, image-based self-supervised learning method for traversability prediction, leveraging a state-of-the-art vision foundation model for improved out-of-distribution performance and demonstrating unprecedented performance for generalization to new environments.

Abstract

Reliable estimation of terrain traversability is critical for the successful deployment of autonomous systems in wild, outdoor environments. Given the lack of large-scale annotated datasets for off-road navigation, strictly-supervised learning approaches remain limited in their generalization ability. To this end, we introduce a novel, image-based self-supervised learning method for traversability prediction, leveraging a state-of-the-art vision foundation model for improved out-of-distribution performance. Our method employs contrastive representation learning using both human driving data and instance-based segmentation masks during training. We show that this simple, yet effective, technique drastically outperforms recent methods in predicting traversability for both on- and off-trail driving scenarios. We compare our method with recent baselines on both a common benchmark as well as our own datasets, covering a diverse range of outdoor environments and varied terrain types. We also demonstrate the compatibility of resulting costmap predictions with a model-predictive controller. Finally, we evaluate our approach on zero- and few-shot tasks, demonstrating unprecedented performance for generalization to new environments. Videos and additional material can be found here: https://sites.google.com/view/visual-traversability-learning.
Paper Structure (19 sections, 5 equations, 9 figures, 4 tables)

This paper contains 19 sections, 5 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: (Top) The Polaris RZR off-road vehicle used for our data collection. (Bottom-left) The vehicle is equipped with multiple RGB-D cameras. (Bottom-middle) Segmentation masks from SAM can disambiguate terrain features. (Bottom-right) Self-supervised learning with mask proposals can achieve robust, fine-grained traversability prediction.
  • Figure 2: Illustration of our occlusion handling and trajectory-/mask-based sampling examples. From a given image and trajectory, we use stereo depth to filter out occluded poses and project them into the image space (b). Afterward, we generate random positive samples (bright-green pixels in (c)) and negative samples (red pixels in (c)) within and outside the projected trajectory. Using positive samples from trajectory as query points, we obtain a mask prediction from SAM to cover the whole traversable region. Then, we randomly sample positive and negative points using the mask (d).
  • Figure 3: Overview of our method. We first incorporate stereo-depth information to filter out occluded trajectory points and then project the trajectory into image space. Then, positive and negative points are sampled based on the trajectory and SAM-predicted mask information. We apply a pre-trained image encoder along with the traversability decoder that outputs traversability features. Afterward, we extract positive and negative features and apply the trajectory-/mask-based contrastive losses to train the decoder. Additionally, we update our traversability vector using a running-average over positive features. This updated vector is used to calculate the similarity at test time, which will be directly translated into traversability costs. Note that a dashed gray arrow denotes gradient stop, and we do not update our encoder during training.
  • Figure 4: Qualitative results of Seo et al.daejeon2023, Schmid et al.jpl2022, and our method on RELLIS-3D and LT Murray datasets. We strongly encourage readers to view the supplementary video for more detailed qualitative results.
  • Figure 5: Qualitative results of Seo et al.daejeon2023, Schmid et al.jpl2022, and our method on CA Hills and Mojave Desert validation sequences. We strongly encourage readers to view the supplementary video for more detailed qualitative results.
  • ...and 4 more figures