Table of Contents
Fetching ...

BEVNav: Robot Autonomous Navigation Via Spatial-Temporal Contrastive Learning in Bird's-Eye View

Jiahao Jiang, Yuxiang Yang, Yingqi Deng, Chenlong Ma, Jing Zhang

TL;DR

This paper proposes a self-supervised spatial-temporal contrastive learning approach to learn BEV representations and incorporates this spatial-temporal contrastive learning in the Soft Actor-Critic reinforcement learning framework to offer a superior navigation policy.

Abstract

Goal-driven mobile robot navigation in map-less environments requires effective state representations for reliable decision-making. Inspired by the favorable properties of Bird's-Eye View (BEV) in point clouds for visual perception, this paper introduces a novel navigation approach named BEVNav. It employs deep reinforcement learning to learn BEV representations and enhance decision-making reliability. First, we propose a self-supervised spatial-temporal contrastive learning approach to learn BEV representations. Spatially, two randomly augmented views from a point cloud predict each other, enhancing spatial features. Temporally, we combine the current observation with consecutive frames' actions to predict future features, establishing the relationship between observation transitions and actions to capture temporal cues. Then, incorporating this spatial-temporal contrastive learning in the Soft Actor-Critic reinforcement learning framework, our BEVNav offers a superior navigation policy. Extensive experiments demonstrate BEVNav's robustness in environments with dense pedestrians, outperforming state-of-the-art methods across multiple benchmarks. \rev{The code will be made publicly available at https://github.com/LanrenzzzZ/BEVNav.

BEVNav: Robot Autonomous Navigation Via Spatial-Temporal Contrastive Learning in Bird's-Eye View

TL;DR

This paper proposes a self-supervised spatial-temporal contrastive learning approach to learn BEV representations and incorporates this spatial-temporal contrastive learning in the Soft Actor-Critic reinforcement learning framework to offer a superior navigation policy.

Abstract

Goal-driven mobile robot navigation in map-less environments requires effective state representations for reliable decision-making. Inspired by the favorable properties of Bird's-Eye View (BEV) in point clouds for visual perception, this paper introduces a novel navigation approach named BEVNav. It employs deep reinforcement learning to learn BEV representations and enhance decision-making reliability. First, we propose a self-supervised spatial-temporal contrastive learning approach to learn BEV representations. Spatially, two randomly augmented views from a point cloud predict each other, enhancing spatial features. Temporally, we combine the current observation with consecutive frames' actions to predict future features, establishing the relationship between observation transitions and actions to capture temporal cues. Then, incorporating this spatial-temporal contrastive learning in the Soft Actor-Critic reinforcement learning framework, our BEVNav offers a superior navigation policy. Extensive experiments demonstrate BEVNav's robustness in environments with dense pedestrians, outperforming state-of-the-art methods across multiple benchmarks. \rev{The code will be made publicly available at https://github.com/LanrenzzzZ/BEVNav.
Paper Structure (19 sections, 7 equations, 4 figures, 3 tables)

This paper contains 19 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: In the BEVNav framework, we propose a Sparse-Dense BEV Network to convert 3D point clouds into BEV features efficiently. This conversion not only creates effective scene representation but also facilitates learning effective state representations via spatial-temporal contrastive learning. As a result, this allows the alignment of these representations to the action space, thus offering a more efficient and accurate navigation policy.
  • Figure 2: Architecture of BEVNav for BEV feature extraction and spatial/temporal contrastive learning. We devise a new Sparse-Dense BEV network to efficiently extract BEV features from 3D point clouds, and use Global Max-pooling to obtain the latent features. Spatial contrastive learning aims to enhance the representation of spatial information by predicting data-augmented features from each other. Temporal contrastive learning aims to combine current observation with continuous frame actions to predict future features, helping to establish the relationship between observation transitions and actions.
  • Figure 3: SAC-based Navigation policy learning framework. It compromises two key components: 1) BEV feature extraction, and 2) BEV-based action decision-making and action evaluation.
  • Figure 4: Gazebo simulation environments: the Square-World and the Lobby-World. The Square-World offers a vast open space, used for training and testing. In contrast, the Lobby-World represents a more complex and dynamic environment, only used for testing and optimizing robotic navigation policy in crowded settings.