Table of Contents
Fetching ...

Attention-Enhanced Cross-modal Localization Between 360 Images and Point Clouds

Zhipeng Zhao, Huai Yu, Chenwei Lyv, Wen Yang, Sebastian Scherer

TL;DR

This work proposes an end-to-end learnable network to conduct cross-modal visual localization by establishing similarity in high-dimensional feature space, inspired by the attention mechanism, and optimize the network to capture the salient feature for comparing images and point clouds.

Abstract

Visual localization plays an important role for intelligent robots and autonomous driving, especially when the accuracy of GNSS is unreliable. Recently, camera localization in LiDAR maps has attracted more and more attention for its low cost and potential robustness to illumination and weather changes. However, the commonly used pinhole camera has a narrow Field-of-View, thus leading to limited information compared with the omni-directional LiDAR data. To overcome this limitation, we focus on correlating the information of 360 equirectangular images to point clouds, proposing an end-to-end learnable network to conduct cross-modal visual localization by establishing similarity in high-dimensional feature space. Inspired by the attention mechanism, we optimize the network to capture the salient feature for comparing images and point clouds. We construct several sequences containing 360 equirectangular images and corresponding point clouds based on the KITTI-360 dataset and conduct extensive experiments. The results demonstrate the effectiveness of our approach.

Attention-Enhanced Cross-modal Localization Between 360 Images and Point Clouds

TL;DR

This work proposes an end-to-end learnable network to conduct cross-modal visual localization by establishing similarity in high-dimensional feature space, inspired by the attention mechanism, and optimize the network to capture the salient feature for comparing images and point clouds.

Abstract

Visual localization plays an important role for intelligent robots and autonomous driving, especially when the accuracy of GNSS is unreliable. Recently, camera localization in LiDAR maps has attracted more and more attention for its low cost and potential robustness to illumination and weather changes. However, the commonly used pinhole camera has a narrow Field-of-View, thus leading to limited information compared with the omni-directional LiDAR data. To overcome this limitation, we focus on correlating the information of 360 equirectangular images to point clouds, proposing an end-to-end learnable network to conduct cross-modal visual localization by establishing similarity in high-dimensional feature space. Inspired by the attention mechanism, we optimize the network to capture the salient feature for comparing images and point clouds. We construct several sequences containing 360 equirectangular images and corresponding point clouds based on the KITTI-360 dataset and conduct extensive experiments. The results demonstrate the effectiveness of our approach.
Paper Structure (26 sections, 10 equations, 7 figures, 4 tables)

This paper contains 26 sections, 10 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Comparison of spherical images and perspective images with point cloud counterparts. The Right side shows the point clouds corresponding to the images on the Left, which were obtained at the same location.
  • Figure 2: A schematic of the cross-modal localization. The localization is performed by comparing the query 360 image with the point clouds sub-maps from the global map and then finding the closest sub-map to determine the location.
  • Figure 3: The Architecture of our Model for Cross-modal Localization. The inputs are the 360 image and the point cloud sub-map.
  • Figure 4: The results of cross-modal localization. The second and third rows show the recall@top1 point cloud sub-map retrieved through ResNet-based Baseline and AE-Spherical Model with the 360 image, where the green frame indicates a correct result and the red frame indicates an incorrect result.
  • Figure 5: Recall@k measure of AE-Spherical Model and ResNet-based Baseline for four tasks including same-modal localization (a), (b) and cross-modal localization (c), (d).
  • ...and 2 more figures