SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs
Yang Miao, Francis Engelmann, Olga Vysotska, Federico Tombari, Marc Pollefeys, Dániel Béla Baráth
TL;DR
SceneGraphLoc addresses cross-modal coarse visual localization by localizing a query image within a reference map built from multi-modal 3D scene graphs. It learns fixed-size embeddings per object node by fusing geometry, multi-view images, and relational context into a shared embedding space, while corresponding query patches are embedded via a visual Transformer using DinoV2 features. Training uses a dual-branch contrastive objective with static and temporal terms to align patch and node embeddings across time, enabling robust matching even under scene changes. Inference computes graph-image similarity by aggregating patch-to-node affinities and retrieves top candidates, achieving strong performance with substantially reduced storage and faster retrieval compared to image-heavy baselines; ablations confirm the value of each modality and backbone choice. The approach, demonstrated on 3RScan and ScanNet, offers a scalable, lightweight localization paradigm that can operate with compact maps while approaching state-of-the-art image-based methods when map images are included, with public-code release planned.
Abstract
We introduce a novel problem, i.e., the localization of an input image within a multi-modal reference map represented by a database of 3D scene graphs. These graphs comprise multiple modalities, including object-level point clouds, images, attributes, and relationships between objects, offering a lightweight and efficient alternative to conventional methods that rely on extensive image databases. Given the available modalities, the proposed method SceneGraphLoc learns a fixed-sized embedding for each node (i.e., representing an object instance) in the scene graph, enabling effective matching with the objects visible in the input query image. This strategy significantly outperforms other cross-modal methods, even without incorporating images into the map embeddings. When images are leveraged, SceneGraphLoc achieves performance close to that of state-of-the-art techniques depending on large image databases, while requiring three orders-of-magnitude less storage and operating orders-of-magnitude faster. The code will be made public.
