Table of Contents
Fetching ...

$\texttt{SPIN}$: distilling $\texttt{Skill-RRT}$ for long-horizon prehensile and non-prehensile manipulation

Haewon Jung, Donguk Lee, Haecheol Park, JunHyeop Kim, Beomjoon Kim

TL;DR

SPIN addresses long-horizon PNP manipulation by distilling a planner, Skill-RRT, into a fast, reactive policy through imitation learning. It introduces connectors to bridge state gaps between separately trained skills and uses Lazy Skill-RRT to efficiently generate training problems for connectors; high-quality plans are then distilled with a diffusion policy, trained on planner trajectories with noise to capture multimodality. The approach achieves zero-shot sim-to-real transfer, delivering high simulated success rates (approximately 95%, 93%, and 98% across Card Flip, Bookshelf, and Kitchen) and strong real-world performance (17/20, 18/20, 16/20) while maintaining practical inference times. This combination of planning-derived skill chaining, learned connectors, and diffusion-based imitation offers a data-efficient path to robust, real-time manipulation in contact-rich, long-horizon tasks with significant practical impact for robotic manipulation systems.

Abstract

Current robots struggle with long-horizon manipulation tasks requiring sequences of prehensile and non-prehensile skills, contact-rich interactions, and long-term reasoning. We present $\texttt{SPIN}$ ($\textbf{S}$kill $\textbf{P}$lanning to $\textbf{IN}$ference), a framework that distills a computationally intensive planning algorithm into a policy via imitation learning. We propose $\texttt{Skill-RRT}$, an extension of RRT that incorporates skill applicability checks and intermediate object pose sampling for solving such long-horizon problems. To chain independently trained skills, we introduce $\textit{connectors}$, goal-conditioned policies trained to minimize object disturbance during transitions. High-quality demonstrations are generated with $\texttt{Skill-RRT}$ and distilled through noise-based replay in order to reduce online computation time. The resulting policy, trained entirely in simulation, transfers zero-shot to the real world and achieves over 80% success across three challenging long-horizon manipulation tasks and outperforms state-of-the-art hierarchical RL and planning methods.

$\texttt{SPIN}$: distilling $\texttt{Skill-RRT}$ for long-horizon prehensile and non-prehensile manipulation

TL;DR

SPIN addresses long-horizon PNP manipulation by distilling a planner, Skill-RRT, into a fast, reactive policy through imitation learning. It introduces connectors to bridge state gaps between separately trained skills and uses Lazy Skill-RRT to efficiently generate training problems for connectors; high-quality plans are then distilled with a diffusion policy, trained on planner trajectories with noise to capture multimodality. The approach achieves zero-shot sim-to-real transfer, delivering high simulated success rates (approximately 95%, 93%, and 98% across Card Flip, Bookshelf, and Kitchen) and strong real-world performance (17/20, 18/20, 16/20) while maintaining practical inference times. This combination of planning-derived skill chaining, learned connectors, and diffusion-based imitation offers a data-efficient path to robust, real-time manipulation in contact-rich, long-horizon tasks with significant practical impact for robotic manipulation systems.

Abstract

Current robots struggle with long-horizon manipulation tasks requiring sequences of prehensile and non-prehensile skills, contact-rich interactions, and long-term reasoning. We present (kill lanning to ference), a framework that distills a computationally intensive planning algorithm into a policy via imitation learning. We propose , an extension of RRT that incorporates skill applicability checks and intermediate object pose sampling for solving such long-horizon problems. To chain independently trained skills, we introduce , goal-conditioned policies trained to minimize object disturbance during transitions. High-quality demonstrations are generated with and distilled through noise-based replay in order to reduce online computation time. The resulting policy, trained entirely in simulation, transfers zero-shot to the real world and achieves over 80% success across three challenging long-horizon manipulation tasks and outperforms state-of-the-art hierarchical RL and planning methods.

Paper Structure

This paper contains 38 sections, 1 equation, 11 figures, 23 tables, 8 algorithms.

Figures (11)

  • Figure 1: Overview of our tasks. (a) Objects and problems. A problem is defined by initial and goal object poses marked with yellow and green. (b) Solutions for these problems. (first row) The robot must flip a thin card, initially in an ungraspable pose, by first moving it to the end of the table to make a space for grasping, flipping it, and finally sliding it to the target pose. (Second row) The robot must put the book in the lower shelf where the robot gripper cannot fit by first toppling the book to enable grasping, picking and placing it to the end of the lower shelf, and then pushing it inside. (Third row) The robot must upright the cup in a sink, grab it, place it on the cupboard, and ensure the handle is inside the desired region by re-orienting it. The red arrow indicates the movement of the object, and the orange arrow represents the movement of the robot.
  • Figure 2: Overview of SPIN: (a) Examples of pre-trained PNP skills in the bookshelf domain (Figure \ref{['fig:CPNP_tasks']}, row 2). Includes pick-and-place, toppling, and pushing. (b) We first use Lazy Skill-RRT to collect problems for training connectors. Left of (b) shows the RRT tree, where the initial and goal states are marked with red and green respectively. Each edge is defined by a skill execution, and the dotted edge and its starting and end vertices, denoted $v$ and $v_{\text{connect}}$ respectively, defines a state gap a connector needs to fill in. The middle of (b) shows the state gap. In $v.s$, the robot has just finished a prehensile skill, with the object still in between the gripper. In $v_{\text{connect}}.s$, the robot is about to begin a pushing skill to push the book into the shelf. A connector has to fill in this state gap. We collect a set of such problems, and use RL to train connectors. (c) Skill-RRT is run with the trained connectors $\mathcal{C}$ and the skill library $\mathcal{K}$ to generate a skill plan $\tau_{\text{skill}}$. A skill plan consists of a sequence of skills, connectors, and their associated desired object poses, or desired robot configuration. (d) We use IL to distill skill plans into a single policy. To filter data, we replay each skill plan $N$ times, and those with a replay success rate below a predefined threshold $m$ are filtered out. The remaining high-quality trajectories are used to train a diffusion policy, which is zero-shot deployed in the real world.
  • Figure 3: Regions for each domain.
  • Figure 4: The real-world setup for each domain is illustrated, with blue polygons representing the target objects and red circles indicating the camera locations.
  • Figure 5: Illustration of motion planner failure cases.
  • ...and 6 more figures