Table of Contents
Fetching ...

VXP: Voxel-Cross-Pixel Large-scale Image-LiDAR Place Recognition

Yun-Jin Li, Mariia Gladkova, Yan Xia, Rui Wang, Daniel Cremers

TL;DR

This work tackles cross-modal place recognition between cameras and LiDAR by introducing VXP, a three-stage pipeline that learns a shared latent space with strong local cross-modal constraints. A DINO ViT-based image encoder provides robust global and local features, while a voxel-based LiDAR branch uses a Voxel-Pixel Projection to align local voxel descriptors with image features, followed by global descriptor alignment. The method achieves state-of-the-art cross-modal retrieval on Oxford RobotCar, ViViD++, and KITTI, while remaining lightweight and suitable for real-time deployment. By leveraging local geometry alongside global context, VXP demonstrates strong cross-modal localization under varying conditions, with targeted ablations and qualitative analyses validating the design choices.

Abstract

Cross-modal place recognition methods are flexible GPS-alternatives under varying environment conditions and sensor setups. However, this task is non-trivial since extracting consistent and robust global descriptors from different modalities is challenging. To tackle this issue, we propose Voxel-Cross-Pixel (VXP), a novel camera-to-LiDAR place recognition framework that enforces local similarities in a self-supervised manner and effectively brings global context from images and LiDAR scans into a shared feature space. Specifically, VXP is trained in three stages: first, we deploy a visual transformer to compactly represent input images. Secondly, we establish local correspondences between image-based and point cloud-based feature spaces using our novel geometric alignment module. We then aggregate local similarities into an expressive shared latent space. Extensive experiments on the three benchmarks (Oxford RobotCar, ViViD++ and KITTI) demonstrate that our method surpasses the state-of-the-art cross-modal retrieval by a large margin. Our evaluations show that the proposed method is accurate, efficient and light-weight. Our project page is available at: https://yunjinli.github.io/projects-vxp/

VXP: Voxel-Cross-Pixel Large-scale Image-LiDAR Place Recognition

TL;DR

This work tackles cross-modal place recognition between cameras and LiDAR by introducing VXP, a three-stage pipeline that learns a shared latent space with strong local cross-modal constraints. A DINO ViT-based image encoder provides robust global and local features, while a voxel-based LiDAR branch uses a Voxel-Pixel Projection to align local voxel descriptors with image features, followed by global descriptor alignment. The method achieves state-of-the-art cross-modal retrieval on Oxford RobotCar, ViViD++, and KITTI, while remaining lightweight and suitable for real-time deployment. By leveraging local geometry alongside global context, VXP demonstrates strong cross-modal localization under varying conditions, with targeted ablations and qualitative analyses validating the design choices.

Abstract

Cross-modal place recognition methods are flexible GPS-alternatives under varying environment conditions and sensor setups. However, this task is non-trivial since extracting consistent and robust global descriptors from different modalities is challenging. To tackle this issue, we propose Voxel-Cross-Pixel (VXP), a novel camera-to-LiDAR place recognition framework that enforces local similarities in a self-supervised manner and effectively brings global context from images and LiDAR scans into a shared feature space. Specifically, VXP is trained in three stages: first, we deploy a visual transformer to compactly represent input images. Secondly, we establish local correspondences between image-based and point cloud-based feature spaces using our novel geometric alignment module. We then aggregate local similarities into an expressive shared latent space. Extensive experiments on the three benchmarks (Oxford RobotCar, ViViD++ and KITTI) demonstrate that our method surpasses the state-of-the-art cross-modal retrieval by a large margin. Our evaluations show that the proposed method is accurate, efficient and light-weight. Our project page is available at: https://yunjinli.github.io/projects-vxp/
Paper Structure (21 sections, 7 equations, 15 figures, 9 tables)

This paper contains 21 sections, 7 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: (Left) Voxel-Cross-Pixel (VXP) can effectively map data from different modalities (2D images and 3D LiDAR scans) into the shared latent space, which exhibits local similarities and captures global context. (Right) Recall for up-to K = 25 retrieved places on Oxford RobotCar benchmark. VXP consistently demonstrates superior cross-modal large-scale global retrieval preformance.
  • Figure 2: VXP pipeline comprises three steps: (1) image network training (\ref{['subsec:image_pretrain']}), (2) cross-modal local feature training (\ref{['subsec:local_train']}), and cross-modal global descriptor training (\ref{['subsec:global_train']}). Starting from step (2) image features are frozen ( *), while the point cloud features are trained ( *). The two networks operate independently during inference, so queries and database samples can be processed separately. The objective is to map different data into a shared latent space and minimize the distance (e.g. L2 norm) between global descriptors of different modalities taken from the same space.
  • Figure 3: Illustration of our proposed local feature optimization between projected voxel- and image-based feature maps. $\phi$ represents "empty" as the 3D feature maps are sparse. Note that the voxel local descriptor is the $\textbf{v}_i^{out}$ introduced in \ref{['eq:output_voxel_formulation']}. After the projection, multiple $\textbf{v}_i^{out}$ could be projected as per \ref{['eq:local_desc_loss']}.
  • Figure 4: DINO fine-tuning effects on attention maps. From left to right: an input image, an attention map generated by pretrained DINO's ViTs-8 without fine-tuning and a map produced after fine-tuning. Due to the latter, important scene structures such as buildings and traffic poles receive higher attention.
  • Figure 5: From left to right: an input image, its attention map and projected feature map generated from the respective point cloud.
  • ...and 10 more figures