UniLoc: Towards Universal Place Recognition Using Any Single Modality

Yan Xia; Zhendong Li; Yun-Jin Li; Letian Shi; Hu Cao; João F. Henriques; Daniel Cremers

UniLoc: Towards Universal Place Recognition Using Any Single Modality

Yan Xia, Zhendong Li, Yun-Jin Li, Letian Shi, Hu Cao, João F. Henriques, Daniel Cremers

TL;DR

UniLoc introduces a universal place recognition framework that operates from any single modality (text, image, or point cloud) by separating the task into instance-level cross-modal alignment and scene-level aggregation via a Self-Attention Pooling module. It learns shared embeddings through contrastive losses across modalities and demonstrates state-of-the-art cross-modal performance on KITTI-360 across six modality pairs, while remaining competitive in uni-modal tasks. The approach leverages large-scale pre-trained encoders (e.g., CLIP) and a modality-centric design to enable robust cross-modal localization and even natural language guided retrieval, with potential extensions to additional modalities. Overall, UniLoc advances sensor-agnostic localization, enabling flexible querying and improved robustness in real-world navigation scenarios.

Abstract

To date, most place recognition methods focus on single-modality retrieval. While they perform well in specific environments, cross-modal methods offer greater flexibility by allowing seamless switching between map and query sources. It also promises to reduce computation requirements by having a unified model, and achieving greater sample efficiency by sharing parameters. In this work, we develop a universal solution to place recognition, UniLoc, that works with any single query modality (natural language, image, or point cloud). UniLoc leverages recent advances in large-scale contrastive learning, and learns by matching hierarchically at two levels: instance-level matching and scene-level matching. Specifically, we propose a novel Self-Attention based Pooling (SAP) module to evaluate the importance of instance descriptors when aggregated into a place-level descriptor. Experiments on the KITTI-360 dataset demonstrate the benefits of cross-modality for place recognition, achieving superior performance in cross-modal settings and competitive results also for uni-modal scenarios. Our project page is publicly available at https://yan-xia.github.io/projects/UniLoc/.

UniLoc: Towards Universal Place Recognition Using Any Single Modality

TL;DR

Abstract

UniLoc: Towards Universal Place Recognition Using Any Single Modality

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)