Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation

Daniel Honerkamp; Martin Büchner; Fabien Despinoy; Tim Welschehold; Abhinav Valada

Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation

Daniel Honerkamp, Martin Büchner, Fabien Despinoy, Tim Welschehold, Abhinav Valada

TL;DR

This work addresses the challenge of autonomous long-horizon reasoning for mobile manipulation in large, unexplored environments by grounding large-language models (LLMs) in dynamically updated, open-vocabulary scene graphs. The MoMa-LLM framework combines a hierarchical 3D scene graph with a navigational Voronoi graph, dynamic RGB-D mapping, and structured language prompts to produce high-level actions executed by low-level policies, enabling zero-shot reasoning across navigation and manipulation tasks. Key contributions include a scalable dynamic scene representation, compact knowledge extraction for LLM grounding, a semantic interactive search task with a novel full-efficiency evaluation curve and AUC-E metric, and successful transfer to a real-world apartment. The approach demonstrates improved search efficiency over baselines and shows promise for generalizing to broader household tasks, marking a significant step toward practically capable, language-guided robots in open environments.

Abstract

To fully leverage the capabilities of mobile manipulation robots, it is imperative that they are able to autonomously execute long-horizon tasks in large unexplored environments. While large language models (LLMs) have shown emergent reasoning skills on arbitrary tasks, existing work primarily concentrates on explored environments, typically focusing on either navigation or manipulation tasks in isolation. In this work, we propose MoMa-LLM, a novel approach that grounds language models within structured representations derived from open-vocabulary scene graphs, dynamically updated as the environment is explored. We tightly interleave these representations with an object-centric action space. Given object detections, the resulting approach is zero-shot, open-vocabulary, and readily extendable to a spectrum of mobile manipulation and household robotic tasks. We demonstrate the effectiveness of MoMa-LLM in a novel semantic interactive search task in large realistic indoor environments. In extensive experiments in both simulation and the real world, we show substantially improved search efficiency compared to conventional baselines and state-of-the-art approaches, as well as its applicability to more abstract tasks. We make the code publicly available at http://moma-llm.cs.uni-freiburg.de.

Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation

TL;DR

Abstract

Paper Structure (39 sections, 10 equations, 10 figures, 8 tables)

This paper contains 39 sections, 10 equations, 10 figures, 8 tables.

Introduction
Related Work
Problem Statement: Embodied Reasoning
MoMa-LLM
Hierarchical 3D Scene Graph
Dynamic RGB-D Mapping
Voronoi Graph
3D Scene Graph
Room Classification
High-Level Action Space
Grounded High-Level Planning
Scene Structure
Partial Observability
History in Dynamic Scenes
Re-trial and Re-planning
...and 24 more sections

Figures (10)

Figure 1: MoMa-LLM performs long-horizon interactive object search in household environments from language queries using dynamically built scene graphs.
Figure 2: MoMa-LLM: From posed RGB-D images and semantics, we construct a semantic 3D map from which we extract a various occupancy maps in the BEV space and construct a navigational Voronoi graph. Through room clustering and room-object assigments we then build up a hierarchical scene graph. From this scalable scene representation, we extract the task-relevant knowledge and encode it into a structured language representation. A large language model then produces high-level commands which are executed by low-level subpolicies. These in turn draw on and update the scene representations.
Figure 3: Room Classification Prompt: based on the objects and room clusters of the scene graph, an LLM performs open-vocabulary classification.
Figure 4: High-level Reasoning Prompt: We encode the extracted scene representation to natural language, providing structured information to a language model.
Figure 5: Interactive search efficiency curve in simulation. Each point depicts the success rate for a given maximum time budget (x-axis).
...and 5 more figures

Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation

TL;DR

Abstract

Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)