Table of Contents
Fetching ...

Collaborative Dynamic 3D Scene Graphs for Open-Vocabulary Urban Scene Understanding

Tim Steinke, Martin Büchner, Niclas Vödisch, Abhinav Valada

TL;DR

This work addresses the challenge of building scalable, semantically rich maps for urban navigation in dynamic environments. It presents CURB-OSG, a centralized, open-vocabulary dynamic 3D scene graph engine that fuses observations from multiple agents with unknown initial poses to jointly optimize a pose graph and generate a hierarchical scene graph. Key contributions include a collaborative SLAM backend with inter-agent loop closures, open-vocabulary perception on each agent, and a multi-layer scene graph that integrates roads, static/dynamic objects, and a semantic 3D map. The approach demonstrates improved mapping and object proposal fusion on real-world Oxford RobotCar data and provides code for reproducibility, highlighting its potential for robust, scalable urban scene understanding in multi-agent settings.

Abstract

Mapping and scene representation are fundamental to reliable planning and navigation in mobile robots. While purely geometric maps using voxel grids allow for general navigation, obtaining up-to-date spatial and semantically rich representations that scale to dynamic large-scale environments remains challenging. In this work, we present CURB-OSG, an open-vocabulary dynamic 3D scene graph engine that generates hierarchical decompositions of urban driving scenes via multi-agent collaboration. By fusing the camera and LiDAR observations from multiple perceiving agents with unknown initial poses, our approach generates more accurate maps compared to a single agent while constructing a unified open-vocabulary semantic hierarchy of the scene. Unlike previous methods that rely on ground truth agent poses or are evaluated purely in simulation, CURB-OSG alleviates these constraints. We evaluate the capabilities of CURB-OSG on real-world multi-agent sensor data obtained from multiple sessions of the Oxford Radar RobotCar dataset. We demonstrate improved mapping and object prediction accuracy through multi-agent collaboration as well as evaluate the environment partitioning capabilities of the proposed approach. To foster further research, we release our code and supplementary material at https://ov-curb.cs.uni-freiburg.de.

Collaborative Dynamic 3D Scene Graphs for Open-Vocabulary Urban Scene Understanding

TL;DR

This work addresses the challenge of building scalable, semantically rich maps for urban navigation in dynamic environments. It presents CURB-OSG, a centralized, open-vocabulary dynamic 3D scene graph engine that fuses observations from multiple agents with unknown initial poses to jointly optimize a pose graph and generate a hierarchical scene graph. Key contributions include a collaborative SLAM backend with inter-agent loop closures, open-vocabulary perception on each agent, and a multi-layer scene graph that integrates roads, static/dynamic objects, and a semantic 3D map. The approach demonstrates improved mapping and object proposal fusion on real-world Oxford RobotCar data and provides code for reproducibility, highlighting its potential for robust, scalable urban scene understanding in multi-agent settings.

Abstract

Mapping and scene representation are fundamental to reliable planning and navigation in mobile robots. While purely geometric maps using voxel grids allow for general navigation, obtaining up-to-date spatial and semantically rich representations that scale to dynamic large-scale environments remains challenging. In this work, we present CURB-OSG, an open-vocabulary dynamic 3D scene graph engine that generates hierarchical decompositions of urban driving scenes via multi-agent collaboration. By fusing the camera and LiDAR observations from multiple perceiving agents with unknown initial poses, our approach generates more accurate maps compared to a single agent while constructing a unified open-vocabulary semantic hierarchy of the scene. Unlike previous methods that rely on ground truth agent poses or are evaluated purely in simulation, CURB-OSG alleviates these constraints. We evaluate the capabilities of CURB-OSG on real-world multi-agent sensor data obtained from multiple sessions of the Oxford Radar RobotCar dataset. We demonstrate improved mapping and object prediction accuracy through multi-agent collaboration as well as evaluate the environment partitioning capabilities of the proposed approach. To foster further research, we release our code and supplementary material at https://ov-curb.cs.uni-freiburg.de.

Paper Structure

This paper contains 12 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: We present CURB-OSG for generating open-vocabulary collaborative dynamic 3D scene graphs to model urban driving scenes. We fuse the observations of multiple agents by performing global inter-agent graph optimization using a centralized mapping instance without assuming initial agent poses. A tight coupling with a 3D scene graph engine allows for merging object proposals using open-vocabulary semantics even under mapping ambiguities.
  • Figure 2: An overview of our CURB-OSG approach operating on LiDAR and camera data from multiple agents. On each agent, we perform open-vocabulary perception that processes monocular image data from surround-view cameras to extract 2D object detections using Grounding DINO liuGroundingDINOMarrying2024 and dynamic object tracks through MASA liMatchingAnythingSegmenting2024. Simultaneously, we estimate each agent's LiDAR odometry via scan matching and construct keyframes that are sent to a central server. All object observations, both static and dynamic, are projected onto the filtered LiDAR point clouds, extracted to obtain 3D object observations relative to the keyframe poses. The central server receives the keyframes and runs graph-based SLAM koide2019hdlgraphslam coupled with LiDAR-based loop closure detection kim2018scancontext to estimate a joint pose graph holding the historic poses of all agents. Finally, all object observations and the semantic point clouds are processed in our 3D scene graph construction module to obtain a unified, hierarchical representation of the environment.
  • Figure 3: In our open-vocabulary (OV) perception module, (1) we employ Grounding DINOliuGroundingDINOMarrying2024 to detect relevant semantic categories, (2) utilize these detections as prompts for TAPpan2024tokenizeAnything to generate semantic masks, and (3) filter the output to include only objects represented in the scene graph.
  • Figure 4: Our dynamic object perception module uses MASA liMatchingAnythingSegmenting2024 with the Grounding DINO liuGroundingDINOMarrying2024 detector to track objects across sequential images. We project the detections onto the point cloud to obtain dynamic 3D object observations. These are then transmitted to the server, contributing to the dynamic objects layer.
  • Figure 5: Evolution of the mean absolute trajectory error (ATE) over time without dynamic object removal. We provide the mean and the standard deviation across the agents. The temporal average of the ATE is reported in \ref{['tab:mapping-eval']}. In the case of multi-agent mapping, we plot the ATE as soon as an initial loop closure is incorporated, thus appearing delayed.
  • ...and 2 more figures