Table of Contents
Fetching ...

Multiview Scene Graph

Juexiao Zhang, Gao Zhu, Sihang Li, Xinhao Liu, Haorui Song, Xinran Tang, Chen Feng

TL;DR

A novel baseline method is developed built on mainstream pretrained vision models, combining visual place recognition and object association into one Transformer decoder architecture, and an evaluation metric based on the intersection-over-union score of MSG edges is proposed.

Abstract

A proper scene representation is central to the pursuit of spatial intelligence where agents can robustly reconstruct and efficiently understand 3D scenes. A scene representation is either metric, such as landmark maps in 3D reconstruction, 3D bounding boxes in object detection, or voxel grids in occupancy prediction, or topological, such as pose graphs with loop closures in SLAM or visibility graphs in SfM. In this work, we propose to build Multiview Scene Graphs (MSG) from unposed images, representing a scene topologically with interconnected place and object nodes. The task of building MSG is challenging for existing representation learning methods since it needs to jointly address both visual place recognition, object detection, and object association from images with limited fields of view and potentially large viewpoint changes. To evaluate any method tackling this task, we developed an MSG dataset and annotation based on a public 3D dataset. We also propose an evaluation metric based on the intersection-over-union score of MSG edges. Moreover, we develop a novel baseline method built on mainstream pretrained vision models, combining visual place recognition and object association into one Transformer decoder architecture. Experiments demonstrate that our method has superior performance compared to existing relevant baselines.

Multiview Scene Graph

TL;DR

A novel baseline method is developed built on mainstream pretrained vision models, combining visual place recognition and object association into one Transformer decoder architecture, and an evaluation metric based on the intersection-over-union score of MSG edges is proposed.

Abstract

A proper scene representation is central to the pursuit of spatial intelligence where agents can robustly reconstruct and efficiently understand 3D scenes. A scene representation is either metric, such as landmark maps in 3D reconstruction, 3D bounding boxes in object detection, or voxel grids in occupancy prediction, or topological, such as pose graphs with loop closures in SLAM or visibility graphs in SfM. In this work, we propose to build Multiview Scene Graphs (MSG) from unposed images, representing a scene topologically with interconnected place and object nodes. The task of building MSG is challenging for existing representation learning methods since it needs to jointly address both visual place recognition, object detection, and object association from images with limited fields of view and potentially large viewpoint changes. To evaluate any method tackling this task, we developed an MSG dataset and annotation based on a public 3D dataset. We also propose an evaluation metric based on the intersection-over-union score of MSG edges. Moreover, we develop a novel baseline method built on mainstream pretrained vision models, combining visual place recognition and object association into one Transformer decoder architecture. Experiments demonstrate that our method has superior performance compared to existing relevant baselines.

Paper Structure

This paper contains 44 sections, 9 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Multiview Scene Graph (MSG). The task of MSG takes unposed RGB images as input and outputs a place+object graph. The graph contains place-place edges and place-object edges. Connected place nodes represent images taken at the same place. The same object recognized from different views is associated and merged as one node and connected to the corresponding place nodes.
  • Figure 2: The AoMSG model. Places and objects queries are obtained by cropping the image feature map using corresponding bounding boxes. The queries are then fed into the Transformer decoder to obtain the final places and objects embeddings. Bounding boxes are in different colors for clarity. The parameters in the Transformer decoder and the linear projector heads are trained with supervised contrastive learning. Image encoder and object detector are pretrained and frozen.
  • Figure 3: Performance of different encoder backbones. We report results from the base models for both ConvNext liu2022convnext and ViT dosovitskiy2020vit.
  • Figure 4: Visualization of the same objects and the same places. Objects are annotated with their predicted IDs.
  • Figure 5: Object embedding visualization using t-SNE van2008tsne. SepMSG-Direct, SepMSG-Linear, and AoMSG-2 are shown in each row respectively. Results from the same scene are aligned vertically. Colors indicate different objects. Each point is an appearance of an object. It is best viewed in color.
  • ...and 6 more figures