Table of Contents
Fetching ...

Toward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs

Sushil Samuel Dinesh, Shinkyu Park

TL;DR

This work addresses the challenge of long-horizon robotic manipulation without task-specific training by fusing off-the-shelf foundation models within a layered, ROS2-based architecture. A persistent scene-graph serves as a shared world model, updated online by LLM–VLM dialogue and guided by a dedicated cognitive layer for multi-step planning, while a fast execution layer handles motion primitives and control. Across fundamental, structured, and advanced reasoning tasks, the framework demonstrates high scene-graph consistency and robust execution, with performance limited primarily by linguistic ambiguity and cluttered scenes. The proposed approach offers a practical path toward scalable, semantically aware robotic manipulation by leveraging existing foundation models rather than task-specific data or fine-tuning.

Abstract

This paper presents a framework that leverages pre-trained foundation models for robotic manipulation without domain-specific training. The framework integrates off-the-shelf models, combining multimodal perception from foundation models with a general-purpose reasoning model capable of robust task sequencing. Scene graphs, dynamically maintained within the framework, provide spatial awareness and enable consistent reasoning about the environment. The framework is evaluated through a series of tabletop robotic manipulation experiments, and the results highlight its potential for building robotic manipulation systems directly on top of off-the-shelf foundation models.

Toward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs

TL;DR

This work addresses the challenge of long-horizon robotic manipulation without task-specific training by fusing off-the-shelf foundation models within a layered, ROS2-based architecture. A persistent scene-graph serves as a shared world model, updated online by LLM–VLM dialogue and guided by a dedicated cognitive layer for multi-step planning, while a fast execution layer handles motion primitives and control. Across fundamental, structured, and advanced reasoning tasks, the framework demonstrates high scene-graph consistency and robust execution, with performance limited primarily by linguistic ambiguity and cluttered scenes. The proposed approach offers a practical path toward scalable, semantically aware robotic manipulation by leveraging existing foundation models rather than task-specific data or fine-tuning.

Abstract

This paper presents a framework that leverages pre-trained foundation models for robotic manipulation without domain-specific training. The framework integrates off-the-shelf models, combining multimodal perception from foundation models with a general-purpose reasoning model capable of robust task sequencing. Scene graphs, dynamically maintained within the framework, provide spatial awareness and enable consistent reasoning about the environment. The framework is evaluated through a series of tabletop robotic manipulation experiments, and the results highlight its potential for building robotic manipulation systems directly on top of off-the-shelf foundation models.

Paper Structure

This paper contains 29 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Overview of the Proposed Framework: The framework is organized into multiple layers, each with distinct capabilities, and is designed to translate high-level natural language commands from the user into an executable sequence of robot actions.
  • Figure 2: (a) Scene Graph Structure. (b) System Architecture. The layers are organized in a bottom-up hierarchy. Execution Layer: Relies on a conventional motion planner and controller to ensure robust and precise object manipulation. Interaction Layer: Utilizes a powerful, non-reasoning model to interpret user instructions and coordinate task execution. Perception Layer: Incorporates a VLM with RGB-D input from a 3D camera to provide spatial understanding, object localization, and semantic scene descriptions Cognitive Layer: Employs a reasoning model for advanced long-horizon planning and decision-making.
  • Figure 3: Experiments I-A and I-B. I-A: (a)–(b) The orange moves from its initial position to between the apple and yarn; (c) shows the VLM-identified point satisfying the "in-between" condition. I-B: (d)-(e) The lemon shifts toward its correct cluster; (f) shows feasible points obtained from the VLM.
  • Figure 4: Experiments I-C and I-D. I-C: (a)-(b) The highlighted non-edible object is selected as the odd one out; I-D: (c)-(d) the robot picks only the ingredients required for fried noodles.
  • Figure 5: Experiments II-A and II-B. II-A: (a)–(b) The blocks progress from their initial arrangement on the table to a fully stacked structure. II-B: (c)-(f) The robot solves the Tower of Hanoi puzzle, moving discs step by step from one base to another adhering to the rules of the game.
  • ...and 3 more figures