Table of Contents
Fetching ...

UniLoc: Towards Universal Place Recognition Using Any Single Modality

Yan Xia, Zhendong Li, Yun-Jin Li, Letian Shi, Hu Cao, João F. Henriques, Daniel Cremers

TL;DR

UniLoc introduces a universal place recognition framework that operates from any single modality (text, image, or point cloud) by separating the task into instance-level cross-modal alignment and scene-level aggregation via a Self-Attention Pooling module. It learns shared embeddings through contrastive losses across modalities and demonstrates state-of-the-art cross-modal performance on KITTI-360 across six modality pairs, while remaining competitive in uni-modal tasks. The approach leverages large-scale pre-trained encoders (e.g., CLIP) and a modality-centric design to enable robust cross-modal localization and even natural language guided retrieval, with potential extensions to additional modalities. Overall, UniLoc advances sensor-agnostic localization, enabling flexible querying and improved robustness in real-world navigation scenarios.

Abstract

To date, most place recognition methods focus on single-modality retrieval. While they perform well in specific environments, cross-modal methods offer greater flexibility by allowing seamless switching between map and query sources. It also promises to reduce computation requirements by having a unified model, and achieving greater sample efficiency by sharing parameters. In this work, we develop a universal solution to place recognition, UniLoc, that works with any single query modality (natural language, image, or point cloud). UniLoc leverages recent advances in large-scale contrastive learning, and learns by matching hierarchically at two levels: instance-level matching and scene-level matching. Specifically, we propose a novel Self-Attention based Pooling (SAP) module to evaluate the importance of instance descriptors when aggregated into a place-level descriptor. Experiments on the KITTI-360 dataset demonstrate the benefits of cross-modality for place recognition, achieving superior performance in cross-modal settings and competitive results also for uni-modal scenarios. Our project page is publicly available at https://yan-xia.github.io/projects/UniLoc/.

UniLoc: Towards Universal Place Recognition Using Any Single Modality

TL;DR

UniLoc introduces a universal place recognition framework that operates from any single modality (text, image, or point cloud) by separating the task into instance-level cross-modal alignment and scene-level aggregation via a Self-Attention Pooling module. It learns shared embeddings through contrastive losses across modalities and demonstrates state-of-the-art cross-modal performance on KITTI-360 across six modality pairs, while remaining competitive in uni-modal tasks. The approach leverages large-scale pre-trained encoders (e.g., CLIP) and a modality-centric design to enable robust cross-modal localization and even natural language guided retrieval, with potential extensions to additional modalities. Overall, UniLoc advances sensor-agnostic localization, enabling flexible querying and improved robustness in real-world navigation scenarios.

Abstract

To date, most place recognition methods focus on single-modality retrieval. While they perform well in specific environments, cross-modal methods offer greater flexibility by allowing seamless switching between map and query sources. It also promises to reduce computation requirements by having a unified model, and achieving greater sample efficiency by sharing parameters. In this work, we develop a universal solution to place recognition, UniLoc, that works with any single query modality (natural language, image, or point cloud). UniLoc leverages recent advances in large-scale contrastive learning, and learns by matching hierarchically at two levels: instance-level matching and scene-level matching. Specifically, we propose a novel Self-Attention based Pooling (SAP) module to evaluate the importance of instance descriptors when aggregated into a place-level descriptor. Experiments on the KITTI-360 dataset demonstrate the benefits of cross-modality for place recognition, achieving superior performance in cross-modal settings and competitive results also for uni-modal scenarios. Our project page is publicly available at https://yan-xia.github.io/projects/UniLoc/.

Paper Structure

This paper contains 32 sections, 8 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: (Left) We present UniLoc, a solution designed for city-scale place recognition using any one of three modalities: text, image, or point cloud. (Right) Localization performance at top-1 recall on the KITTI-360 liao2022kitti test set. The proposed UniLoc achieves state-of-the-art performance across all six cross-modality place recognition: image-to-point cloud (I2P), point cloud-to-image (P2I), text-to-point cloud (T2P), point cloud-to-text (P2T), image-to-text (I2T), and text-to-image (T2P). Notably, UniLoc surpasses existing SOTA cross-modal methods by a large margin while achieving competitive performance in uni-modal place recognition.
  • Figure 2: Overview of the proposed pipeline, consisting of instance-level (sec. \ref{['instance block']}) and scene-level (sec. \ref{['scene model']}) matching stages.
  • Figure 3: (Top) The architecture of instance-level matching. It consists of three instance-level feature extraction blocks: Text Instance Block (TXIB), Image Instance Block (IMIB), and Point Cloud Instance Block (PCIB). We train an Image-Text and an Image-Point cloud model to align image-text instances and image-point cloud instances, respectively. (Bottom) The architecture of the image and point cloud instance encoders. Note that the pre-trained CLIP image and text encoders are frozen during training.
  • Figure 4: (Top) The proposed fine matching architecture of UniLoc. It consists of triple parallel feature extraction branches: Point cloud, image, and text. (Bottom) The architecture of the proposed Self-Attention based Pooling (SAP) module.
  • Figure 5: Performance comparison for Text-Image place recognition on the KITTI-360 dataset. "X-VLM/GeoText-finetune" indicates that we finetune the model on the KITTI-360 dataset.
  • ...and 5 more figures