Table of Contents
Fetching ...

SPIRE: Synergistic Planning, Imitation, and Reinforcement Learning for Long-Horizon Manipulation

Zihan Zhou, Animesh Garg, Dieter Fox, Caelan Garrett, Ajay Mandlekar

TL;DR

SPIRE is a system that first uses Task and Motion Planning to decompose tasks into smaller learning subproblems and second combines imitation and reinforcement learning to maximize their strengths and outperforms prior approaches that integrate imitation learning, reinforcement learning, and planning.

Abstract

Robot learning has proven to be a general and effective technique for programming manipulators. Imitation learning is able to teach robots solely from human demonstrations but is bottlenecked by the capabilities of the demonstrations. Reinforcement learning uses exploration to discover better behaviors; however, the space of possible improvements can be too large to start from scratch. And for both techniques, the learning difficulty increases proportional to the length of the manipulation task. Accounting for this, we propose SPIRE, a system that first uses Task and Motion Planning (TAMP) to decompose tasks into smaller learning subproblems and second combines imitation and reinforcement learning to maximize their strengths. We develop novel strategies to train learning agents when deployed in the context of a planning system. We evaluate SPIRE on a suite of long-horizon and contact-rich robot manipulation problems. We find that SPIRE outperforms prior approaches that integrate imitation learning, reinforcement learning, and planning by 35% to 50% in average task performance, is 6 times more data efficient in the number of human demonstrations needed to train proficient agents, and learns to complete tasks nearly twice as efficiently. View https://sites.google.com/view/spire-corl-2024 for more details.

SPIRE: Synergistic Planning, Imitation, and Reinforcement Learning for Long-Horizon Manipulation

TL;DR

SPIRE is a system that first uses Task and Motion Planning to decompose tasks into smaller learning subproblems and second combines imitation and reinforcement learning to maximize their strengths and outperforms prior approaches that integrate imitation learning, reinforcement learning, and planning.

Abstract

Robot learning has proven to be a general and effective technique for programming manipulators. Imitation learning is able to teach robots solely from human demonstrations but is bottlenecked by the capabilities of the demonstrations. Reinforcement learning uses exploration to discover better behaviors; however, the space of possible improvements can be too large to start from scratch. And for both techniques, the learning difficulty increases proportional to the length of the manipulation task. Accounting for this, we propose SPIRE, a system that first uses Task and Motion Planning (TAMP) to decompose tasks into smaller learning subproblems and second combines imitation and reinforcement learning to maximize their strengths. We develop novel strategies to train learning agents when deployed in the context of a planning system. We evaluate SPIRE on a suite of long-horizon and contact-rich robot manipulation problems. We find that SPIRE outperforms prior approaches that integrate imitation learning, reinforcement learning, and planning by 35% to 50% in average task performance, is 6 times more data efficient in the number of human demonstrations needed to train proficient agents, and learns to complete tasks nearly twice as efficiently. View https://sites.google.com/view/spire-corl-2024 for more details.

Paper Structure

This paper contains 25 sections, 1 theorem, 4 equations, 12 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

Let $\mathsf{P}$ be a SPIRE-compatible planner, i.e., $\mathsf{P}$ satisfies sequence validity, section validity, and goal validity. There exists an agent $\pi_*$ with which SPIRE reaches the goal state deterministically within a finite number of steps if the initial state is admissible by $\mathsf{

Figures (12)

  • Figure 1: SPIRE Overview. (Left) SPIRE first attempts to solve the task with a TAMP system. When the TAMP planner encounters an action deemed too hard to plan, it then enters the handoff section and delegates the action to a human teleoperator to manually complete it. We record the trajectories from the human operators to build a demonstration dataset and train an IL policy with it. Finally, we train an RL policy to fine-tune the IL policy via warmstarting and deviation constraining. (Right) The four handoff sections in Coffee Preparation.
  • Figure 2: SPIRE execution.SPIRE computes a TAMP plan but defers execution of certain contact-rich skills, such as insert and hang, to learned agents -- we call these handoff sections. The preconditions of each handoff section define the initial state distribution of the agent, and the postconditions of each action correspond to the termination states of the corresponding MDP for the handoff section.
  • Figure 3: Full evaluation. Comparing the success rates (left) and the average duration (right) of successful rollouts of HITL-TAMP-BC (BC), TAMP-gated Plan-Seq-Learn (RL), and SPIRE (Ours) across all 9 tasks. Each datapoint is chosen from the best run out of 5 seeds and is averaged from 50 rollouts. SPIRE improves the BC policy in terms of both success rate and average duration in all 9 tasks and reaches 80% success rate in 8. RL has an advantage in average duration in the easier set of tasks but fails to learn anything in the rest.
  • Figure 4: Qualitative comparison. Rollouts of vanilla RL vs our method. The first agent attempts to close the lid by knocking the coffee machine, while our agent follows the demonstrations and closes the lid with fingers.
  • Figure 5: Demo efficiency and sampling strategy ablation. (Left) Minimum number of demos needed to reach at least $80\%$ success rate. (Right) Success rates across 5 seeds in Tool Hang, comparing permissive and sequential strategies. sequential has a lower variance but permissive has the better top-1 policy.
  • ...and 7 more figures

Theorems & Definitions (1)

  • Theorem 1