Chain of Thought Imitation with Procedure Cloning

Mengjiao Yang; Dale Schuurmans; Pieter Abbeel; Ofir Nachum

Chain of Thought Imitation with Procedure Cloning

Mengjiao Yang, Dale Schuurmans, Pieter Abbeel, Ofir Nachum

TL;DR

The paper addresses limited generalization in imitation learning by introducing procedure cloning, which learns not only the expert action but the sequence of intermediate computations (the chain of thought) that produced it. It formalizes procedure observations and proposes two factorization strategies for the joint distribution p(a, x|s), enabling autoregressive sequence modeling of computations with a transformer-like architecture. Across synthetic maze navigation, AntMaze navigation, image-based manipulation, and MinAtar game playing, procedure cloning demonstrates superior generalization to unseen configurations compared with behavioral cloning and auxiliary-task approaches. This approach leverages richer supervision to bridge planning and decision making, offering a scalable pathway to more robust imitation-learning policies in diverse domains.

Abstract

Imitation learning aims to extract high-performance policies from logged demonstrations of expert behavior. It is common to frame imitation learning as a supervised learning problem in which one fits a function approximator to the input-output mapping exhibited by the logged demonstrations (input observations to output actions). While the framing of imitation learning as a supervised input-output learning problem allows for applicability in a wide variety of settings, it is also an overly simplistic view of the problem in situations where the expert demonstrations provide much richer insight into expert behavior. For example, applications such as path navigation, robot manipulation, and strategy games acquire expert demonstrations via planning, search, or some other multi-step algorithm, revealing not just the output action to be imitated but also the procedure for how to determine this action. While these intermediate computations may use tools not available to the agent during inference (e.g., environment simulators), they are nevertheless informative as a way to explain an expert's mapping of state to actions. To properly leverage expert procedure information without relying on the privileged tools the expert may have used to perform the procedure, we propose procedure cloning, which applies supervised sequence prediction to imitate the series of expert computations. This way, procedure cloning learns not only what to do (i.e., the output action), but how and why to do it (i.e., the procedure). Through empirical analysis on navigation, simulated robotic manipulation, and game-playing environments, we show that imitating the intermediate computations of an expert's behavior enables procedure cloning to learn policies exhibiting significant generalization to unseen environment configurations, including those configurations for which running the expert's procedure directly is infeasible.

Chain of Thought Imitation with Procedure Cloning

TL;DR

Abstract

Paper Structure (49 sections, 5 equations, 12 figures, 1 table, 2 algorithms)

This paper contains 49 sections, 5 equations, 12 figures, 1 table, 2 algorithms.

Introduction
Related Work
Generalization in sequential decision making.
Access to additional task information
Chain of thought sequence modeling.
Preliminaries
MDP notations.
Imitation learning.
Behavioral cloning (BC).
Procedure cloning
Chain of thought imitation
Procedures and procedure observations
Procedure cloning
Connection to BC with auxiliary tasks.
Proof of concept: Synthetic maze navigation
...and 34 more sections

Figures (12)

Figure 1: Visualization of the dataset collection, training, and inference of BC and PC on a maze navigation task. During dataset collection, the expert uses a search procedure to determine the optimal action to generate a path to the goal location (red star). During training, BC discards these intermediate search outputs and learns to map states to actions directly. In contrast, PC learns the complete sequence of intermediate computations (i.e., branches and backtracks) associated with the search procedure. During inference, PC generates a sequence of intermediate search outcomes emulating the search procedure on a new test map before outputting the final action.
Figure 2: Graphical models of vanilla BC, auxiliary BC, and procedure cloning with autoregressive and conditionally independent factorization. Node $s$ represents an input MDP state, $a$ represents an expert action, and $\textbf{x}$ represents the sequence of procedure observations $(x_0,...,x_L)$.
Figure 3: In a discrete maze, the expert employs BFS by first expanding a search perimeter until it encounters the goal cell, at which point it backtracks to find the optimal action at the starting state (cells in light blue are visited and dark blue are backtracked). We encode this algorithm as a sequence of procedure observations $(x_0,...,x_6)$ of the intermediate computation states, with each $x_i$ represented by a 2D array and each cell of the array containing BFS-relevant information (i.e., whether this cell is being expanded or backtracked and the action recorded when expanding to this cell). Procedure cloning is trained to predict the entire sequence of computations from input state to output action using a sequential model $p(a|x_L)\cdot\Pi_{l=1}^L p(x_\ell|x_{\ell-1})\cdot p(x_0|s)$.
Figure 4: [Left] Visualization of the discrete maze (4 discrete actions) and AntMaze (8 continuous actions). [Right] Average success rate of PC and BC agents navigating to the goal from random start locations over 10 test mazes. Agents are trained on 5, 10, 20, 40 mazes of 1 and 5 expert trajectories on discrete maze and AntMaze, respectively. We find that procedure cloning leads to much better test maze generalization compared to alternative approaches.
Figure 5: [Left] Visualization of the bimanual sweep task. [Middle] Average success metric (proportion of particles in bowls at the end of the episode) of PC and BC agents completing the bimanual sweeping task after learning on 10, 100, 1000 expert trajectories; each variant is an aggregate of 10 runs. All of our algorithm implementations use the implicit loss function described in florence2022implicit for this task. [Right] When using 1000 expert demonstrations with early stopping, PC achieves 83.9% compared to 78.2% success of the existing state-of-the-art achieved by implicit BC.
...and 7 more figures

Chain of Thought Imitation with Procedure Cloning

TL;DR

Abstract

Chain of Thought Imitation with Procedure Cloning

Authors

TL;DR

Abstract

Table of Contents

Figures (12)