Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation
Daniel Honerkamp, Martin Büchner, Fabien Despinoy, Tim Welschehold, Abhinav Valada
TL;DR
This work addresses the challenge of autonomous long-horizon reasoning for mobile manipulation in large, unexplored environments by grounding large-language models (LLMs) in dynamically updated, open-vocabulary scene graphs. The MoMa-LLM framework combines a hierarchical 3D scene graph with a navigational Voronoi graph, dynamic RGB-D mapping, and structured language prompts to produce high-level actions executed by low-level policies, enabling zero-shot reasoning across navigation and manipulation tasks. Key contributions include a scalable dynamic scene representation, compact knowledge extraction for LLM grounding, a semantic interactive search task with a novel full-efficiency evaluation curve and AUC-E metric, and successful transfer to a real-world apartment. The approach demonstrates improved search efficiency over baselines and shows promise for generalizing to broader household tasks, marking a significant step toward practically capable, language-guided robots in open environments.
Abstract
To fully leverage the capabilities of mobile manipulation robots, it is imperative that they are able to autonomously execute long-horizon tasks in large unexplored environments. While large language models (LLMs) have shown emergent reasoning skills on arbitrary tasks, existing work primarily concentrates on explored environments, typically focusing on either navigation or manipulation tasks in isolation. In this work, we propose MoMa-LLM, a novel approach that grounds language models within structured representations derived from open-vocabulary scene graphs, dynamically updated as the environment is explored. We tightly interleave these representations with an object-centric action space. Given object detections, the resulting approach is zero-shot, open-vocabulary, and readily extendable to a spectrum of mobile manipulation and household robotic tasks. We demonstrate the effectiveness of MoMa-LLM in a novel semantic interactive search task in large realistic indoor environments. In extensive experiments in both simulation and the real world, we show substantially improved search efficiency compared to conventional baselines and state-of-the-art approaches, as well as its applicability to more abstract tasks. We make the code publicly available at http://moma-llm.cs.uni-freiburg.de.
