Table of Contents
Fetching ...

Symskill: Symbol and Skill Co-Invention for Data-Efficient and Real-Time Long-Horizon Manipulation

Yifei Simon Shao, Yuchen Zheng, Sunan Sun, Pratik Chaudhari, Vijay Kumar, Nadia Figueroa

TL;DR

SymSkill addresses the challenge of data-efficient, real-time long-horizon manipulation by co-inventing symbols and skills from unlabeled, unsegmented demonstrations and executing via a fast symbolic planner and SE(3) DS policies. It grounds predicates in relative frames using a vision-language model, derives operators from repeating predicate transitions, and learns end-to-end DS-based skills per operator, enabling real-time recovery through replanning and resampling. In RoboCasa simulations and real-world experiments, it achieves up to 85% success on single-step tasks, composes multi-step plans without additional data, and demonstrates learning from around 5 minutes of play, with robust disturbance rejection and safe, continuous execution. The approach offers a data-efficient, real-time, robust framework for long-horizon manipulation and provides open-source code for broader adoption.

Abstract

Multi-step manipulation in dynamic environments remains challenging. Two major families of methods fail in distinct ways: (i) imitation learning (IL) is reactive but lacks compositional generalization, as monolithic policies do not decide which skill to reuse when scenes change; (ii) classical task-and-motion planning (TAMP) offers compositionality but has prohibitive planning latency, preventing real-time failure recovery. We introduce SymSkill, a unified learning framework that combines the benefits of IL and TAMP, allowing compositional generalization and failure recovery in real-time. Offline, SymSkill jointly learns predicates, operators, and skills directly from unlabeled and unsegmented demonstrations. At execution time, upon specifying a conjunction of one or more learned predicates, SymSkill uses a symbolic planner to compose and reorder learned skills to achieve the symbolic goals, while performing recovery at both the motion and symbolic levels in real time. Coupled with a compliant controller, SymSkill enables safe and uninterrupted execution under human and environmental disturbances. In RoboCasa simulation, SymSkill can execute 12 single-step tasks with 85% success rate. Without additional data, it composes these skills into multi-step plans requiring up to 6 skill recompositions, recovering robustly from execution failures. On a real Franka robot, we demonstrate SymSkill, learning from 5 minutes of unsegmented and unlabeled play data, is capable of performing multiple tasks simply by goal specifications. The source code and additional analysis can be found on https://sites.google.com/view/symskill.

Symskill: Symbol and Skill Co-Invention for Data-Efficient and Real-Time Long-Horizon Manipulation

TL;DR

SymSkill addresses the challenge of data-efficient, real-time long-horizon manipulation by co-inventing symbols and skills from unlabeled, unsegmented demonstrations and executing via a fast symbolic planner and SE(3) DS policies. It grounds predicates in relative frames using a vision-language model, derives operators from repeating predicate transitions, and learns end-to-end DS-based skills per operator, enabling real-time recovery through replanning and resampling. In RoboCasa simulations and real-world experiments, it achieves up to 85% success on single-step tasks, composes multi-step plans without additional data, and demonstrates learning from around 5 minutes of play, with robust disturbance rejection and safe, continuous execution. The approach offers a data-efficient, real-time, robust framework for long-horizon manipulation and provides open-source code for broader adoption.

Abstract

Multi-step manipulation in dynamic environments remains challenging. Two major families of methods fail in distinct ways: (i) imitation learning (IL) is reactive but lacks compositional generalization, as monolithic policies do not decide which skill to reuse when scenes change; (ii) classical task-and-motion planning (TAMP) offers compositionality but has prohibitive planning latency, preventing real-time failure recovery. We introduce SymSkill, a unified learning framework that combines the benefits of IL and TAMP, allowing compositional generalization and failure recovery in real-time. Offline, SymSkill jointly learns predicates, operators, and skills directly from unlabeled and unsegmented demonstrations. At execution time, upon specifying a conjunction of one or more learned predicates, SymSkill uses a symbolic planner to compose and reorder learned skills to achieve the symbolic goals, while performing recovery at both the motion and symbolic levels in real time. Coupled with a compliant controller, SymSkill enables safe and uninterrupted execution under human and environmental disturbances. In RoboCasa simulation, SymSkill can execute 12 single-step tasks with 85% success rate. Without additional data, it composes these skills into multi-step plans requiring up to 6 skill recompositions, recovering robustly from execution failures. On a real Franka robot, we demonstrate SymSkill, learning from 5 minutes of unsegmented and unlabeled play data, is capable of performing multiple tasks simply by goal specifications. The source code and additional analysis can be found on https://sites.google.com/view/symskill.

Paper Structure

This paper contains 24 sections, 22 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of the SymSkill predicate and skill co-invention process on a DoorOpen task. Left: In the premotion segment (end-effector only motion), the object in motion in the next segment is treated as the object of interest ${o_\textrm{int}}$, and its frame serves as the reference for both predicate and skill learning. End-effector trajectories in this frame are used to fit SE(3) LPV-DS skills, and their endpoints are clustered to yield object-gripper relative pose predicates ${}^{ {o_\textrm{int}}}\psi_{ee}$. Right: In the motion segment (gripper + object moving), a reference object ${o_\textrm{ref}}$ is selected by querying a VLM on frames from the segment. Gripper trajectories are then expressed in the ${o_\textrm{ref}}$ frame and used to fit a DS skill. Endpoints of the manipulated object trajectory in the ${o_\textrm{ref}}$ frame are clustered to yield object–object relative pose predicates ${}^ {o_\textrm{ref}}\psi_ {o_\textrm{int}}$.
  • Figure 2: SymSkill offline pipeline (top half) and the online pipeline (bottom half). Subsection \ref{['subsec: demo_seg_ref']} (purple) describes segmentation and reference frame selection. Subsection \ref{['sec:symbol-learning']} (orange) describes how predicates are learned for each segment. Subsection \ref{['sec: operator_learning']} (green) learns the operators for online planning. Subsection \ref{['sec: skill_learning']} (blue) describes how each operator's skill is learned. Subsection \ref{['sec: online']} (yellow) describes how SymSkill operates online.
  • Figure 3: The VLM prompt used for the real-world learning-from-play experiment proceeds as follows. First, the initial image is used to obtain text descriptions of all objects in view. Next, four equally spaced images from each motion segment are provided to Gemini together with the required output enumeration object, using the structured output feature. The returned text is then mapped back to the corresponding object name.
  • Figure 4: The visualization of demonstrations and SE(3) LPV-DS policy rollout for Op3 in Tab.\ref{['tb:nsrt']}. The left figure shows multiple collected trajectories placing a thing type item from various locations into the pan. The multimodal nature of the data is captured by 4 distinct Gaussians shown in different colors following the policy learning outlined in Sec. \ref{['sec: prelim-skill']}. The right figure shows the reconstruction results of the learned policy starting from the same initial conditions, where the policy pose attractor in the pan frame is marked as an axis. All demonstrations converge on the attractor.
  • Figure 5: Real-world data collection pipeline. We use a motion capture system to record object interactions in the workspace. Here we show one motion episode with a sequence of timestamped images; the manipulated object (${o_\textrm{int}}$) is a banana. Frames with orange, yellow, and green banners denote the premotion, motion, and post-motion segments, respectively.
  • ...and 2 more figures