Table of Contents
Fetching ...

Build on Priors: Vision--Language--Guided Neuro-Symbolic Imitation Learning for Data-Efficient Real-World Robot Manipulation

Pierrick Lorang, Johannes Huemer, Timothy Duggan, Kai Goebel, Patrik Zips, Matthias Scheutz

Abstract

Enabling robots to learn long-horizon manipulation tasks from a handful of demonstrations remains a central challenge in robotics. Existing neuro-symbolic approaches often rely on hand-crafted symbolic abstractions, semantically labeled trajectories or large demonstration datasets, limiting their scalability and real-world applicability. We present a scalable neuro-symbolic framework that autonomously constructs symbolic planning domains and data-efficient control policies from as few as one to thirty unannotated skill demonstrations, without requiring manual domain engineering. Our method segments demonstrations into skills and employs a Vision-Language Model (VLM) to classify skills and identify equivalent high-level states, enabling automatic construction of a state-transition graph. This graph is processed by an Answer Set Programming solver to synthesize a PDDL planning domain, which an oracle function exploits to isolate the minimal, task-relevant and target relative observation and action spaces for each skill policy. Policies are learned at the control reference level rather than at the raw actuator signal level, yielding a smoother and less noisy learning target. Known controllers can be leveraged for real-world data augmentation by projecting a single demonstration onto other objects in the scene, simultaneously enriching the graph construction process and the dataset for imitation learning. We validate our framework primarily on a real industrial forklift across statistically rigorous manipulation trials, and demonstrate cross-platform generality on a Kinova Gen3 robotic arm across two standard benchmarks. Our results show that grounding control learning, VLM-driven abstraction, and automated planning synthesis into a unified pipeline constitutes a practical path toward scalable, data-efficient, expert-free and interpretable neuro-symbolic robotics.

Build on Priors: Vision--Language--Guided Neuro-Symbolic Imitation Learning for Data-Efficient Real-World Robot Manipulation

Abstract

Enabling robots to learn long-horizon manipulation tasks from a handful of demonstrations remains a central challenge in robotics. Existing neuro-symbolic approaches often rely on hand-crafted symbolic abstractions, semantically labeled trajectories or large demonstration datasets, limiting their scalability and real-world applicability. We present a scalable neuro-symbolic framework that autonomously constructs symbolic planning domains and data-efficient control policies from as few as one to thirty unannotated skill demonstrations, without requiring manual domain engineering. Our method segments demonstrations into skills and employs a Vision-Language Model (VLM) to classify skills and identify equivalent high-level states, enabling automatic construction of a state-transition graph. This graph is processed by an Answer Set Programming solver to synthesize a PDDL planning domain, which an oracle function exploits to isolate the minimal, task-relevant and target relative observation and action spaces for each skill policy. Policies are learned at the control reference level rather than at the raw actuator signal level, yielding a smoother and less noisy learning target. Known controllers can be leveraged for real-world data augmentation by projecting a single demonstration onto other objects in the scene, simultaneously enriching the graph construction process and the dataset for imitation learning. We validate our framework primarily on a real industrial forklift across statistically rigorous manipulation trials, and demonstrate cross-platform generality on a Kinova Gen3 robotic arm across two standard benchmarks. Our results show that grounding control learning, VLM-driven abstraction, and automated planning synthesis into a unified pipeline constitutes a practical path toward scalable, data-efficient, expert-free and interpretable neuro-symbolic robotics.

Paper Structure

This paper contains 73 sections, 21 equations, 14 figures, 1 table, 1 algorithm.

Figures (14)

  • Figure 1: Our forklift system loading two pallets from the ground and unloading them onto a truck. Such system builds on priors (perception, VLM for annotation and controls) to imitate long-horizon and complex task planning and acting from only 10 base skills demonstrations and 1 extra adaptation skill trajectory.
  • Figure 2: Training Pipeline. Two shared input columns feed three processing lanes, each producing one component of the learned system $\Phi$ (right). Shared input: Raw demonstrations (vision, proprioception, vocabulary $\Lambda$, change-point breakpoints $\hat{T}$) are augmented by projecting waypoint skeletons onto all scene objects ($N \rightarrow N \times K$ trajectories). Augmented data feeds all three lanes. Lane A (object detection): OWLv2 distills a compact YOLOv8 student online. Detections are lifted to 3D and combined with proprioception to form ego-centric trajectories $\tau = \{ p_{\text{obj}} - p_{\text{ee}}, a_t \}$, tracked across time. Delivers the YOLOv8 Detector. Lane B (symbolic abstraction): A VLM classifies each skill segment by comparing before/after scene images against $\Lambda$, building graph $G$ that an ASP solver converts into PDDL domain $\sigma = \langle \mathcal{E}, \mathcal{F}, \mathcal{S}, \mathcal{O} \rangle$. Delivers PDDL Domain $\sigma$, and feeds $\sigma$ back (dashed) to Lane C. Lane C (control learning—sequential after Lane B): The Oracle reads $\sigma$ to construct minimal, ego-centric observations $\tilde{s}^{o_i}_t$ per operator. Diffusion sub-policies with Transformer termination predictors $\beta_{i,j}$ are assembled into per-operator skill automata. Delivers Skill Automata $\{\pi_i\}$.
  • Figure 3: Execution Pipeline. At execution time, the system proceeds through five stages. (1) Scene grounding: YOLOv8 detects task-relevant objects, 3D positions are estimated, and relational predicates are evaluated to construct the initial symbolic state $s_0$. The user specifies a goal $s_g$ as a partial state. (2) Symbolic planning: MetricFF solves the PDDL instance $\mathcal{T} = \langle \mathcal{E}, \mathcal{F}, \mathcal{O}, s_0, s_g \rangle$ in milliseconds, producing a human-readable, inspectable plan $\mathcal{P}$. (3) Hierarchical skill execution: For each operator $o_i$, the Oracle $\phi$ resolves symbolic entity bindings to physical detections and constructs the ego-centric observation. Diffusion sub-policies execute closed-loop at $\sim$20 Hz; a Transformer termination predictor $\beta_{i,j}$ gates transitions between action steps. The tracker updates the symbolic state after each operator completes. (4) Generalization: Three levels of generalization are handled by distinct mechanisms—relative observations for intra-task generalization, replanning for inter-task generalization, and on-the-fly OWLv2 distillation for cross-object transfer. (5) Termination: Execution terminates when $s_t \models s_g$.
  • Figure 4: Pallet alignment problem with respect to ADAPT kinematics. Relevant states include: rear body pose $\mathbf{q}_R = [x_R, y_R, \theta_R]^\mathrm{T}$, fork tip pose $\mathbf{q}_T = [x_T, y_T, \theta_T, z_T]^\mathrm{T}$, and target pallet pose $\mathbf{q}_P = [x_P, y_P, \theta_P, z_P]^\mathrm{T}$, along with the articulation angle $\gamma$ and control inputs $v$ and $\dot{\gamma}$.
  • Figure 5: Main evaluation domain: an automated forklift for managing multiple pallets in outdoor scenario. The pallets can be stored on the ground or on a truck platform. Image credit: © AIT/tm-photography.
  • ...and 9 more figures