Toward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs
Sushil Samuel Dinesh, Shinkyu Park
TL;DR
This work addresses the challenge of long-horizon robotic manipulation without task-specific training by fusing off-the-shelf foundation models within a layered, ROS2-based architecture. A persistent scene-graph serves as a shared world model, updated online by LLM–VLM dialogue and guided by a dedicated cognitive layer for multi-step planning, while a fast execution layer handles motion primitives and control. Across fundamental, structured, and advanced reasoning tasks, the framework demonstrates high scene-graph consistency and robust execution, with performance limited primarily by linguistic ambiguity and cluttered scenes. The proposed approach offers a practical path toward scalable, semantically aware robotic manipulation by leveraging existing foundation models rather than task-specific data or fine-tuning.
Abstract
This paper presents a framework that leverages pre-trained foundation models for robotic manipulation without domain-specific training. The framework integrates off-the-shelf models, combining multimodal perception from foundation models with a general-purpose reasoning model capable of robust task sequencing. Scene graphs, dynamically maintained within the framework, provide spatial awareness and enable consistent reasoning about the environment. The framework is evaluated through a series of tabletop robotic manipulation experiments, and the results highlight its potential for building robotic manipulation systems directly on top of off-the-shelf foundation models.
