Table of Contents
Fetching ...

SparTa: Sparse Graphical Task Models from a Handful of Demonstrations

Adrian Röfer, Nick Heppert, Abhinav Valada

TL;DR

This paper introduces SparTa, an object-centric framework for learning what a manipulation task seeks to achieve by constructing sparse task skeletons from demonstrations. It segments demonstrations into manipulation graphs, generates events from topological changes, and matches objects across demonstrations using pre-trained features to extract a minimal, probabilistic task skeleton. The learned model provides distributions over relative object poses at task transitions and enables planning and execution in new environments, including zero-shot transfer to a real robot. Experiments on HANDSOME and Robocasa, plus real-robot deployment, show robust segmentation and improved model fidelity with additional demonstrations, while also highlighting failure modes in coordinated or ambiguous tasks.

Abstract

Learning long-horizon manipulation tasks efficiently is a central challenge in robot learning from demonstration. Unlike recent endeavors that focus on directly learning the task in the action domain, we focus on inferring what the robot should achieve in the task, rather than how to do so. To this end, we represent evolving scene states using a series of graphical object relationships. We propose a demonstration segmentation and pooling approach that extracts a series of manipulation graphs and estimates distributions over object states across task phases. In contrast to prior graph-based methods that capture only partial interactions or short temporal windows, our approach captures complete object interactions spanning from the onset of control to the end of the manipulation. To improve robustness when learning from multiple demonstrations, we additionally perform object matching using pre-trained visual features. In extensive experiments, we evaluate our method's demonstration segmentation accuracy and the utility of learning from multiple demonstrations for finding a desired minimal task model. Finally, we deploy the fitted models both in simulation and on a real robot, demonstrating that the resulting task representations support reliable execution across environments.

SparTa: Sparse Graphical Task Models from a Handful of Demonstrations

TL;DR

This paper introduces SparTa, an object-centric framework for learning what a manipulation task seeks to achieve by constructing sparse task skeletons from demonstrations. It segments demonstrations into manipulation graphs, generates events from topological changes, and matches objects across demonstrations using pre-trained features to extract a minimal, probabilistic task skeleton. The learned model provides distributions over relative object poses at task transitions and enables planning and execution in new environments, including zero-shot transfer to a real robot. Experiments on HANDSOME and Robocasa, plus real-robot deployment, show robust segmentation and improved model fidelity with additional demonstrations, while also highlighting failure modes in coordinated or ambiguous tasks.

Abstract

Learning long-horizon manipulation tasks efficiently is a central challenge in robot learning from demonstration. Unlike recent endeavors that focus on directly learning the task in the action domain, we focus on inferring what the robot should achieve in the task, rather than how to do so. To this end, we represent evolving scene states using a series of graphical object relationships. We propose a demonstration segmentation and pooling approach that extracts a series of manipulation graphs and estimates distributions over object states across task phases. In contrast to prior graph-based methods that capture only partial interactions or short temporal windows, our approach captures complete object interactions spanning from the onset of control to the end of the manipulation. To improve robustness when learning from multiple demonstrations, we additionally perform object matching using pre-trained visual features. In extensive experiments, we evaluate our method's demonstration segmentation accuracy and the utility of learning from multiple demonstrations for finding a desired minimal task model. Finally, we deploy the fitted models both in simulation and on a real robot, demonstrating that the resulting task representations support reliable execution across environments.
Paper Structure (14 sections, 5 equations, 6 figures)

This paper contains 14 sections, 5 equations, 6 figures.

Figures (6)

  • Figure 1: Our approach extracts sparse task skeletons from demonstrations. Using object trajectories, it builds a series of manipulation graphs and generates events over graph changes. Using pre-trained features, objects are matched across demonstrations, the events of the same objects are grouped, and a task skeleton is extracted. At inference, these events are interpreted as grasping and placement actions, and the target poses are inferred from the extracted distributions.
  • Figure 2: Schematic example of our segmentation approach. The manipulator $m$ moves towards object $1$ and starts pushing it towards object $2$ (frame $2$). When objects $1$ and $2$ touch, $2$ joins the pushing motion (frame $3$). The manipulation is completed in frame $4$, yet the manipulator remains in its location relative to $1, 2$ until frame $5$. Finally, the manipulator parts from the objects (frame $6$). The first three plots illustrate the $MI$ signal for these three objects and the background object $3$. Once objects are moving together, their mutual information rises. When the motion stops, the mutual information also returns to $0$. In these phases of high $MI$, the connection likelihood model $p_e(a, b)$ is formed, which scores two objects being connected based on their distances. Using this model, we identify the time steps in which objects are not being manipulated (area identified with red hatches). Using these areas to form the distribution of objects being at rest, we prune the connections $e_{a,b}$ between objects and manipulator $m$ to exclude all resting frames at the end of the manipulation.
  • Figure 3: Top: Visualization of topological changes which emit events. Blue nodes represent manipulators, orange nodes represent other objects. Bottom: Graphical representation of the $k$-assignment problem ($k$-ap) underlying the re-identification of objects across demonstrations. The colors represent the different features of the objects. The dotted lines indicate possible associations, bold lines show a full $3$-assignment of the given problem.
  • Figure 4: Top Row: Results of segmentation and event generation. We report the normalized Event Detection Accuracy (EDA) and over-/under segmentation (SA) where $0$ is ideal. Our approach exhibits an average success rate of $85\%$, while showing a tendency to over-segment by $20\%$. Over-segmentation is more pronounced on Robocasa data, which includes many failed or accidental manipulations. Bottom Rows: Results of extracting task models using our proposed approach. We sample a random $8$ demonstrations for each task and then fit a model from these, starting with $2$ and incrementally increasing to the full set. We sample $100$ of these sets per task and report the averaged metrics. Middle: Success of our approach at extracting the correct model steps. We observe that adding more demonstrations improves the accuracy of the fitted model, with saturation occurring quickly. Bottom: For all steps $Add(*, H_W)$ the reference group $H_W$ is reduced to the minimal entropy object. This lowers the model accuracy. In HANDSOME, the effect is limited to a few tasks, but dramatic. In Robocasa, the effect is more spread out and mostly delays the saturation of yield through additional demonstrations.
  • Figure 5: Top: We compare the likelihood of fitted models against a test set of demonstrations, given different numbers of training samples. We normalize by the overall change in likelihood within one experiment. We see additional training samples quickly yielding diminishing improvements in the likelihood. Bottom: Evaluation of structurally correct, minimized models on Robocasa with magic actions. We note that performance is constant across training samples but differs strongly between tasks (blue lines).
  • ...and 1 more figures