Collaborative Dynamic 3D Scene Graphs for Open-Vocabulary Urban Scene Understanding
Tim Steinke, Martin Büchner, Niclas Vödisch, Abhinav Valada
TL;DR
This work addresses the challenge of building scalable, semantically rich maps for urban navigation in dynamic environments. It presents CURB-OSG, a centralized, open-vocabulary dynamic 3D scene graph engine that fuses observations from multiple agents with unknown initial poses to jointly optimize a pose graph and generate a hierarchical scene graph. Key contributions include a collaborative SLAM backend with inter-agent loop closures, open-vocabulary perception on each agent, and a multi-layer scene graph that integrates roads, static/dynamic objects, and a semantic 3D map. The approach demonstrates improved mapping and object proposal fusion on real-world Oxford RobotCar data and provides code for reproducibility, highlighting its potential for robust, scalable urban scene understanding in multi-agent settings.
Abstract
Mapping and scene representation are fundamental to reliable planning and navigation in mobile robots. While purely geometric maps using voxel grids allow for general navigation, obtaining up-to-date spatial and semantically rich representations that scale to dynamic large-scale environments remains challenging. In this work, we present CURB-OSG, an open-vocabulary dynamic 3D scene graph engine that generates hierarchical decompositions of urban driving scenes via multi-agent collaboration. By fusing the camera and LiDAR observations from multiple perceiving agents with unknown initial poses, our approach generates more accurate maps compared to a single agent while constructing a unified open-vocabulary semantic hierarchy of the scene. Unlike previous methods that rely on ground truth agent poses or are evaluated purely in simulation, CURB-OSG alleviates these constraints. We evaluate the capabilities of CURB-OSG on real-world multi-agent sensor data obtained from multiple sessions of the Oxford Radar RobotCar dataset. We demonstrate improved mapping and object prediction accuracy through multi-agent collaboration as well as evaluate the environment partitioning capabilities of the proposed approach. To foster further research, we release our code and supplementary material at https://ov-curb.cs.uni-freiburg.de.
