Table of Contents
Fetching ...

Dynamic Open-Vocabulary 3D Scene Graphs for Long-term Language-Guided Mobile Manipulation

Zhijie Yan, Shufei Li, Zuoxu Wang, Lixiu Wu, Han Wang, Jun Zhu, Lijiang Chen, Jihong Liu

TL;DR

DovSG tackles long-term mobile manipulation in dynamic indoor environments by constructing dynamic open-vocabulary 3D scene graphs from RGB-D streams and updating them locally as scenes change. A GPT-4o-powered language-guided planner decomposes tasks into manageable subtasks that are executed via navigation and manipulation modules, while ACE-based relocalization and ICP refinement maintain accurate localization during updates. The system integrates open-vocabulary 3D object mapping, memory-efficient graph updates, and a memory-aware task planner to enable robust long-term performance with dynamic scene changes. Real-world experiments show that DovSG achieves higher long-term task success, faster memory updates, and lower memory usage compared to static-scene baselines, demonstrating practical impact for adaptive mobile manipulation.

Abstract

Enabling mobile robots to perform long-term tasks in dynamic real-world environments is a formidable challenge, especially when the environment changes frequently due to human-robot interactions or the robot's own actions. Traditional methods typically assume static scenes, which limits their applicability in the continuously changing real world. To overcome these limitations, we present DovSG, a novel mobile manipulation framework that leverages dynamic open-vocabulary 3D scene graphs and a language-guided task planning module for long-term task execution. DovSG takes RGB-D sequences as input and utilizes vision-language models (VLMs) for object detection to obtain high-level object semantic features. Based on the segmented objects, a structured 3D scene graph is generated for low-level spatial relationships. Furthermore, an efficient mechanism for locally updating the scene graph, allows the robot to adjust parts of the graph dynamically during interactions without the need for full scene reconstruction. This mechanism is particularly valuable in dynamic environments, enabling the robot to continually adapt to scene changes and effectively support the execution of long-term tasks. We validated our system in real-world environments with varying degrees of manual modifications, demonstrating its effectiveness and superior performance in long-term tasks. Our project page is available at: https://bjhyzj.github.io/dovsg-web.

Dynamic Open-Vocabulary 3D Scene Graphs for Long-term Language-Guided Mobile Manipulation

TL;DR

DovSG tackles long-term mobile manipulation in dynamic indoor environments by constructing dynamic open-vocabulary 3D scene graphs from RGB-D streams and updating them locally as scenes change. A GPT-4o-powered language-guided planner decomposes tasks into manageable subtasks that are executed via navigation and manipulation modules, while ACE-based relocalization and ICP refinement maintain accurate localization during updates. The system integrates open-vocabulary 3D object mapping, memory-efficient graph updates, and a memory-aware task planner to enable robust long-term performance with dynamic scene changes. Real-world experiments show that DovSG achieves higher long-term task success, faster memory updates, and lower memory usage compared to static-scene baselines, demonstrating practical impact for adaptive mobile manipulation.

Abstract

Enabling mobile robots to perform long-term tasks in dynamic real-world environments is a formidable challenge, especially when the environment changes frequently due to human-robot interactions or the robot's own actions. Traditional methods typically assume static scenes, which limits their applicability in the continuously changing real world. To overcome these limitations, we present DovSG, a novel mobile manipulation framework that leverages dynamic open-vocabulary 3D scene graphs and a language-guided task planning module for long-term task execution. DovSG takes RGB-D sequences as input and utilizes vision-language models (VLMs) for object detection to obtain high-level object semantic features. Based on the segmented objects, a structured 3D scene graph is generated for low-level spatial relationships. Furthermore, an efficient mechanism for locally updating the scene graph, allows the robot to adjust parts of the graph dynamically during interactions without the need for full scene reconstruction. This mechanism is particularly valuable in dynamic environments, enabling the robot to continually adapt to scene changes and effectively support the execution of long-term tasks. We validated our system in real-world environments with varying degrees of manual modifications, demonstrating its effectiveness and superior performance in long-term tasks. Our project page is available at: https://bjhyzj.github.io/dovsg-web.

Paper Structure

This paper contains 35 sections, 13 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of Our DovSG System.DovSG is a mobile robotic system designed to perform long-term tasks in real-world environments. It can detect changes in the scene during task execution, ensuring that subsequent subtasks are completed correctly. The system consists of five main components: perception, memory, task planning, navigation, and manipulation. The memory module includes a lower-level semantic memory and a higher-level scene graph, both of which are continuously updated as the robot explores the environment. This enables the robot to promptly detect manual changes (e.g., keys being moved from cabinet to table) and make the necessary adjustments for subsequent tasks (such as correctly executing Task 2-2).
  • Figure 2: Initialization and Construction of 3D Scene Graphs. We first use the RGB-D-based DROID-SLAM teed2021droid model to predict the pose of each frame in the scene. Then, we apply an advanced Open-Vocal segmentation model to segment regions in the RGB images, extract semantic feature vectors for each region, and project them onto a 3D point cloud. Based on semantic, geometric, and CLIP feature similarities, the same object captured from multiple views is gradually associated and fused, resulting in a series of 3D objects. Next, we infer the relationships between objects based on their spatial positions and generate edges connecting these objects, forming a scene graph. This scene graph provides a structured and comprehensive understanding of the scene, allowing efficient localization of target objects and enabling easy reconstruction and updating in dynamic environments, and supports task planning for large language models.
  • Figure 3: Adaptation in interactions with manually modified scenes. (1) We train the scene-specific regression MLP of the ACE model using RGB images and their poses, making the process highly efficient. (2) After manual scene modification, multi-view observations allow rough global pose estimation via ACE, refined further using LightGlue and ICP. The new viewpoint’s point cloud closely aligns with the stored pose. (3) The bottom image shows accurate local updates to the scene based on observations from the new viewpoint.
  • Figure 4: Two proposed grasp strategies in DovSG. In the first row, we cropped the point cloud input into anyGrasp within a certain range around the target object, allowing anyGrasp to focus more on the target object without compromising the generation of collision-free grasps. Furthermore, we filtered the grasps based on translational and rotational costs, with the red grasps indicating the highest confidence. In the second row, we show our heuristic grasp strategy, which leverages the object's bounding box information to rotate and select the most appropriate grasp orientation.
  • Figure 5: Degrees of environmental modifications. The left column shows the initial state of the scene, while the two columns on the right represent the state of the scene after manual modifications.