Table of Contents
Fetching ...

StereoNavNet: Learning to Navigate using Stereo Cameras with Auxiliary Occupancy Voxels

Hongyu Li, Taskin Padir, Huaizu Jiang

TL;DR

Through extensive empirical evaluation, it is demonstrated that SNN outperforms baseline approaches in terms of success rates, success weighted by path length, and navigation error, and exhibits better generalizability, characterized by maintaining leading performance when navigating across previously unseen environments.

Abstract

Visual navigation has received significant attention recently. Most of the prior works focus on predicting navigation actions based on semantic features extracted from visual encoders. However, these approaches often rely on large datasets and exhibit limited generalizability. In contrast, our approach draws inspiration from traditional navigation planners that operate on geometric representations, such as occupancy maps. We propose StereoNavNet (SNN), a novel visual navigation approach employing a modular learning framework comprising perception and policy modules. Within the perception module, we estimate an auxiliary 3D voxel occupancy grid from stereo RGB images and extract geometric features from it. These features, along with user-defined goals, are utilized by the policy module to predict navigation actions. Through extensive empirical evaluation, we demonstrate that SNN outperforms baseline approaches in terms of success rates, success weighted by path length, and navigation error. Furthermore, SNN exhibits better generalizability, characterized by maintaining leading performance when navigating across previously unseen environments.

StereoNavNet: Learning to Navigate using Stereo Cameras with Auxiliary Occupancy Voxels

TL;DR

Through extensive empirical evaluation, it is demonstrated that SNN outperforms baseline approaches in terms of success rates, success weighted by path length, and navigation error, and exhibits better generalizability, characterized by maintaining leading performance when navigating across previously unseen environments.

Abstract

Visual navigation has received significant attention recently. Most of the prior works focus on predicting navigation actions based on semantic features extracted from visual encoders. However, these approaches often rely on large datasets and exhibit limited generalizability. In contrast, our approach draws inspiration from traditional navigation planners that operate on geometric representations, such as occupancy maps. We propose StereoNavNet (SNN), a novel visual navigation approach employing a modular learning framework comprising perception and policy modules. Within the perception module, we estimate an auxiliary 3D voxel occupancy grid from stereo RGB images and extract geometric features from it. These features, along with user-defined goals, are utilized by the policy module to predict navigation actions. Through extensive empirical evaluation, we demonstrate that SNN outperforms baseline approaches in terms of success rates, success weighted by path length, and navigation error. Furthermore, SNN exhibits better generalizability, characterized by maintaining leading performance when navigating across previously unseen environments.
Paper Structure (14 sections, 4 equations, 5 figures, 4 tables)

This paper contains 14 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: A high-level comparison. Unlike the conventional visual navigation approach (a), where visual input is typically encoded into semantic features using a visual encoder for subsequent action prediction, we introduce a novel visual navigation network (c) inspired by traditional navigation approaches (b). We extract an auxiliary voxel occupancy grid from semantic features using a neural network (NN) and derive geometric obstacle features from it, which our policy network is conditioned on.
  • Figure 2: The network design of StereoNavNet. We propose to extract the occupancy features from explicit geometry using the voxel occupancy grid and the curvature samples. The extracted features are used to predict an action using a four-layer MLP.
  • Figure 3: Training scenes. We deploy a privileged agent in five scenes from OmniGibson li_behavior-1k_2023 to collect an expert demonstration dataset.
  • Figure 4: Comparison against baseline approaches. We compare against agents using ResNet, Depth Anything, MobileStereoNet (MSNet), and ground-truth voxel occupancy grid (GT Grid) using three metrics: SPL, SR, and NE. The agent using ground truth only serves as the upper bound for our policy module and is not comparable. The error bars show the standard errors.
  • Figure 5: Qualitative results. We present the visualizations of navigation experiments from four scenes. The top environment is seen in the demonstration dataset, and the rest are novel environments. We label the successful trials in green and the failures in red. The agents start from the light color and move towards the dark color.