Table of Contents
Fetching ...

GraSP-VLA: Graph-based Symbolic Action Representation for Long-Horizon Planning with VLA Policies

Maëlic Neau, Zoe Falomir, Paulo E. Santos, Anne-Gwenn Bosser, Cédric Buche

TL;DR

GraSP-VLA addresses long-horizon robotic planning from demonstrations by grounding symbolic planning in a Continuous Scene Graph built from a four-layer Scene Graph. It generates PDDL actions automatically from observed transitions and schedules a bank of Vision-Language Action policies through a synchronized orchestrator, enabling online task decomposition without extensive domain priors. The approach is validated across indoor SGG benchmarks, DAHLIA-based planning domain generation, and real-world SO-101 experiments, showing improved long-horizon execution via decomposition even as SGG accuracy remains a bottleneck. Overall, the work advances neuro-symbolic planning by tightly coupling persistent perception-grounded relations with modular policy execution for scalable, open-ended imitation learning.

Abstract

Deploying autonomous robots that can learn new skills from demonstrations is an important challenge of modern robotics. Existing solutions often apply end-to-end imitation learning with Vision-Language Action (VLA) models or symbolic approaches with Action Model Learning (AML). On the one hand, current VLA models are limited by the lack of high-level symbolic planning, which hinders their abilities in long-horizon tasks. On the other hand, symbolic approaches in AML lack generalization and scalability perspectives. In this paper we present a new neuro-symbolic approach, GraSP-VLA, a framework that uses a Continuous Scene Graph representation to generate a symbolic representation of human demonstrations. This representation is used to generate new planning domains during inference and serves as an orchestrator for low-level VLA policies, scaling up the number of actions that can be reproduced in a row. Our results show that GraSP-VLA is effective for modeling symbolic representations on the task of automatic planning domain generation from observations. In addition, results on real-world experiments show the potential of our Continuous Scene Graph representation to orchestrate low-level VLA policies in long-horizon tasks.

GraSP-VLA: Graph-based Symbolic Action Representation for Long-Horizon Planning with VLA Policies

TL;DR

GraSP-VLA addresses long-horizon robotic planning from demonstrations by grounding symbolic planning in a Continuous Scene Graph built from a four-layer Scene Graph. It generates PDDL actions automatically from observed transitions and schedules a bank of Vision-Language Action policies through a synchronized orchestrator, enabling online task decomposition without extensive domain priors. The approach is validated across indoor SGG benchmarks, DAHLIA-based planning domain generation, and real-world SO-101 experiments, showing improved long-horizon execution via decomposition even as SGG accuracy remains a bottleneck. Overall, the work advances neuro-symbolic planning by tightly coupling persistent perception-grounded relations with modular policy execution for scalable, open-ended imitation learning.

Abstract

Deploying autonomous robots that can learn new skills from demonstrations is an important challenge of modern robotics. Existing solutions often apply end-to-end imitation learning with Vision-Language Action (VLA) models or symbolic approaches with Action Model Learning (AML). On the one hand, current VLA models are limited by the lack of high-level symbolic planning, which hinders their abilities in long-horizon tasks. On the other hand, symbolic approaches in AML lack generalization and scalability perspectives. In this paper we present a new neuro-symbolic approach, GraSP-VLA, a framework that uses a Continuous Scene Graph representation to generate a symbolic representation of human demonstrations. This representation is used to generate new planning domains during inference and serves as an orchestrator for low-level VLA policies, scaling up the number of actions that can be reproduced in a row. Our results show that GraSP-VLA is effective for modeling symbolic representations on the task of automatic planning domain generation from observations. In addition, results on real-world experiments show the potential of our Continuous Scene Graph representation to orchestrate low-level VLA policies in long-horizon tasks.

Paper Structure

This paper contains 13 sections, 5 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overall architecture of GraSP-VLA. Top: automatic PDDL action extraction from a single demonstration using Continuous Scene Graph Generation. Bottom: task execution using a bank of pre-trained VLA policies.
  • Figure 2: Example of state refinement for a relation between two nodes at a given layer. States are represented by the label of the relation, for instance, $8 = above$ and $5 = below$. The sliding window is set to 3 timestamps (i.e. $\theta = 3$).
  • Figure 3: Example of a transition identified using the interactions of the Topological and Functional layers of the Continuous Scene Graph for the action Moving glass to shelf.
  • Figure 4: (a): initial setup; (b) - (c): possible end configurations.

Theorems & Definitions (4)

  • Definition 1: Continuous Scene Graph
  • Definition 2: Updates
  • Definition 3: Relations
  • Definition 4: Action