Table of Contents
Fetching ...

SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs

Yang Miao, Francis Engelmann, Olga Vysotska, Federico Tombari, Marc Pollefeys, Dániel Béla Baráth

TL;DR

SceneGraphLoc addresses cross-modal coarse visual localization by localizing a query image within a reference map built from multi-modal 3D scene graphs. It learns fixed-size embeddings per object node by fusing geometry, multi-view images, and relational context into a shared embedding space, while corresponding query patches are embedded via a visual Transformer using DinoV2 features. Training uses a dual-branch contrastive objective with static and temporal terms to align patch and node embeddings across time, enabling robust matching even under scene changes. Inference computes graph-image similarity by aggregating patch-to-node affinities and retrieves top candidates, achieving strong performance with substantially reduced storage and faster retrieval compared to image-heavy baselines; ablations confirm the value of each modality and backbone choice. The approach, demonstrated on 3RScan and ScanNet, offers a scalable, lightweight localization paradigm that can operate with compact maps while approaching state-of-the-art image-based methods when map images are included, with public-code release planned.

Abstract

We introduce a novel problem, i.e., the localization of an input image within a multi-modal reference map represented by a database of 3D scene graphs. These graphs comprise multiple modalities, including object-level point clouds, images, attributes, and relationships between objects, offering a lightweight and efficient alternative to conventional methods that rely on extensive image databases. Given the available modalities, the proposed method SceneGraphLoc learns a fixed-sized embedding for each node (i.e., representing an object instance) in the scene graph, enabling effective matching with the objects visible in the input query image. This strategy significantly outperforms other cross-modal methods, even without incorporating images into the map embeddings. When images are leveraged, SceneGraphLoc achieves performance close to that of state-of-the-art techniques depending on large image databases, while requiring three orders-of-magnitude less storage and operating orders-of-magnitude faster. The code will be made public.

SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs

TL;DR

SceneGraphLoc addresses cross-modal coarse visual localization by localizing a query image within a reference map built from multi-modal 3D scene graphs. It learns fixed-size embeddings per object node by fusing geometry, multi-view images, and relational context into a shared embedding space, while corresponding query patches are embedded via a visual Transformer using DinoV2 features. Training uses a dual-branch contrastive objective with static and temporal terms to align patch and node embeddings across time, enabling robust matching even under scene changes. Inference computes graph-image similarity by aggregating patch-to-node affinities and retrieves top candidates, achieving strong performance with substantially reduced storage and faster retrieval compared to image-heavy baselines; ablations confirm the value of each modality and backbone choice. The approach, demonstrated on 3RScan and ScanNet, offers a scalable, lightweight localization paradigm that can operate with compact maps while approaching state-of-the-art image-based methods when map images are included, with public-code release planned.

Abstract

We introduce a novel problem, i.e., the localization of an input image within a multi-modal reference map represented by a database of 3D scene graphs. These graphs comprise multiple modalities, including object-level point clouds, images, attributes, and relationships between objects, offering a lightweight and efficient alternative to conventional methods that rely on extensive image databases. Given the available modalities, the proposed method SceneGraphLoc learns a fixed-sized embedding for each node (i.e., representing an object instance) in the scene graph, enabling effective matching with the objects visible in the input query image. This strategy significantly outperforms other cross-modal methods, even without incorporating images into the map embeddings. When images are leveraged, SceneGraphLoc achieves performance close to that of state-of-the-art techniques depending on large image databases, while requiring three orders-of-magnitude less storage and operating orders-of-magnitude faster. The code will be made public.
Paper Structure (16 sections, 10 equations, 9 figures, 7 tables)

This paper contains 16 sections, 10 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Successful and failed cases for scene retrieval with $R^t@1$. On the left are the G.T. and predicted objects of query image within the target scene graph. On the right are the top-3 retrieved scenes with their image-scene similarity scores.
  • Figure 2: Overview. The training phase is represented by orange arrows, while blue arrows denote the inference phase. During training, a query image and its associated 3D scene graph form a positive sample within a contrastive learning framework, where negative samples are generated by associating scene graphs of different scenes with the same query image. The objective is to learn the embeddings of both the graph and the image so that embeddings of the positive pair are drawn closer, whereas those of the negative pair are pushed apart. In the inference phase, the task involves assigning the correct scene graph to a given query image from a selection of multiple graphs, achieved by optimizing the cosine similarity between their embeddings.
  • Figure 2: Shannon entropy $\mathcal{H}_I$, denoting the diversity of objects observed in the query image.
  • Figure 3: The embedding of image modality $\mathcal{I}$ for each object. The image crops of a pillow are shown as an example.
  • Figure 3: Comparison of G.T. and predicted instance segmentation in ScanNet dataset dai2017scannet. The left image shows that SceneGraphFusion wu2021scenegraphfusion applied in the Section 4 in the main paper can output inaccurate instance segmentation (red box) and under-reconstruction (white boxes) results.
  • ...and 4 more figures