Table of Contents
Fetching ...

VANP: Learning Where to See for Navigation with Self-Supervised Vision-Action Pre-Training

Mohammad Nazeri, Junzhe Wang, Amirreza Payandeh, Xuesu Xiao

TL;DR

VANP introduces a self-supervised Vision-Action Pretraining framework that learns navigation-relevant visual features from RGB sequences without relying on downstream supervision. It encodes visual history, future actions, and a visual goal with two Transformer Encoders and trains with a VICReg-based mutual information objective to emphasize navigation-relevant regions while avoiding negative samples. Empirical results show VANP matches end-to-end performance while halving training time and achieving similar downstream usefulness with only a tiny fraction of ImageNet data, with activation maps conducive to navigation decisions. The approach offers practical advantages for real-world robotic navigation by reducing data and compute requirements and providing interpretable, navigation-focused features.

Abstract

Humans excel at efficiently navigating through crowds without collision by focusing on specific visual regions relevant to navigation. However, most robotic visual navigation methods rely on deep learning models pre-trained on vision tasks, which prioritize salient objects -- not necessarily relevant to navigation and potentially misleading. Alternative approaches train specialized navigation models from scratch, requiring significant computation. On the other hand, self-supervised learning has revolutionized computer vision and natural language processing, but its application to robotic navigation remains underexplored due to the difficulty of defining effective self-supervision signals. Motivated by these observations, in this work, we propose a Self-Supervised Vision-Action Model for Visual Navigation Pre-Training (VANP). Instead of detecting salient objects that are beneficial for tasks such as classification or detection, VANP learns to focus only on specific visual regions that are relevant to the navigation task. To achieve this, VANP uses a history of visual observations, future actions, and a goal image for self-supervision, and embeds them using two small Transformer Encoders. Then, VANP maximizes the information between the embeddings by using a mutual information maximization objective function. We demonstrate that most VANP-extracted features match with human navigation intuition. VANP achieves comparable performance as models learned end-to-end with half the training time and models trained on a large-scale, fully supervised dataset, i.e., ImageNet, with only 0.08% data.

VANP: Learning Where to See for Navigation with Self-Supervised Vision-Action Pre-Training

TL;DR

VANP introduces a self-supervised Vision-Action Pretraining framework that learns navigation-relevant visual features from RGB sequences without relying on downstream supervision. It encodes visual history, future actions, and a visual goal with two Transformer Encoders and trains with a VICReg-based mutual information objective to emphasize navigation-relevant regions while avoiding negative samples. Empirical results show VANP matches end-to-end performance while halving training time and achieving similar downstream usefulness with only a tiny fraction of ImageNet data, with activation maps conducive to navigation decisions. The approach offers practical advantages for real-world robotic navigation by reducing data and compute requirements and providing interpretable, navigation-focused features.

Abstract

Humans excel at efficiently navigating through crowds without collision by focusing on specific visual regions relevant to navigation. However, most robotic visual navigation methods rely on deep learning models pre-trained on vision tasks, which prioritize salient objects -- not necessarily relevant to navigation and potentially misleading. Alternative approaches train specialized navigation models from scratch, requiring significant computation. On the other hand, self-supervised learning has revolutionized computer vision and natural language processing, but its application to robotic navigation remains underexplored due to the difficulty of defining effective self-supervision signals. Motivated by these observations, in this work, we propose a Self-Supervised Vision-Action Model for Visual Navigation Pre-Training (VANP). Instead of detecting salient objects that are beneficial for tasks such as classification or detection, VANP learns to focus only on specific visual regions that are relevant to the navigation task. To achieve this, VANP uses a history of visual observations, future actions, and a goal image for self-supervision, and embeds them using two small Transformer Encoders. Then, VANP maximizes the information between the embeddings by using a mutual information maximization objective function. We demonstrate that most VANP-extracted features match with human navigation intuition. VANP achieves comparable performance as models learned end-to-end with half the training time and models trained on a large-scale, fully supervised dataset, i.e., ImageNet, with only 0.08% data.
Paper Structure (13 sections, 2 equations, 4 figures, 2 tables)

This paper contains 13 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Comparison of Activation Maps Learned by End-to-End, ImageNet, and VANP. VANP can extract multiple regions of interest for navigation without downstream navigation supervision compared to single salient regions by End-to-End and ImageNet pre-trained models.
  • Figure 2: VANP Architecture. VANP learns to embed temporal features into spatial features by using a sequence of images and leveraging two TransformerEncoders with context tokens. VANP's loss maximizes the mutual information between history, future actions, and the goal (left). Then, by appending an MLP to the Transformer context token, VANP predicts future trajectories during the downstream navigation task (right).
  • Figure 3: Qualitative Comparison. Comparison of the last layer activation maps among different methods on unseen scenarios.
  • Figure 4: Failure Cases. Samples without any important intra-frame changes cause the model to collapse.