Table of Contents
Fetching ...

Chain-of-Thought Predictive Control

Zhiwei Jia, Vineet Thumuluri, Fangchen Liu, Linghao Chen, Zhiao Huang, Hao Su

TL;DR

CoTPC tackles the challenge of learning generalizable policies for low-level robotic control from sub-optimal demonstrations. It combines unsupervised subskill discovery to extract chain-of-thought sequences and a Transformer with learnable CoT prompts to jointly predict subskills and actions, equipped with a hybrid masking scheme for dynamic guidance. Across Moving Maze, Franka Kitchen, and ManiSkill2, CoTPC consistently outperforms strong baselines and ablations validate the benefits of coupled subskill-action predictions and CoT supervision. This work advances offline imitation learning by leveraging hierarchical planning signals without requiring optimal demos, enabling better transfer to varied tasks and environments.

Abstract

We study generalizable policy learning from demonstrations for complex low-level control (e.g., contact-rich object manipulations). We propose a novel hierarchical imitation learning method that utilizes sub-optimal demos. Firstly, we propose an observation space-agnostic approach that efficiently discovers the multi-step subskill decomposition of the demos in an unsupervised manner. By grouping temporarily close and functionally similar actions into subskill-level demo segments, the observations at the segment boundaries constitute a chain of planning steps for the task, which we refer to as the chain-of-thought (CoT). Next, we propose a Transformer-based design that effectively learns to predict the CoT as the subskill-level guidance. We couple action and subskill predictions via learnable prompt tokens and a hybrid masking strategy, which enable dynamically updated guidance at test time and improve feature representation of the trajectory for generalizable policy learning. Our method, Chain-of-Thought Predictive Control (CoTPC), consistently surpasses existing strong baselines on challenging manipulation tasks with sub-optimal demos.

Chain-of-Thought Predictive Control

TL;DR

CoTPC tackles the challenge of learning generalizable policies for low-level robotic control from sub-optimal demonstrations. It combines unsupervised subskill discovery to extract chain-of-thought sequences and a Transformer with learnable CoT prompts to jointly predict subskills and actions, equipped with a hybrid masking scheme for dynamic guidance. Across Moving Maze, Franka Kitchen, and ManiSkill2, CoTPC consistently outperforms strong baselines and ablations validate the benefits of coupled subskill-action predictions and CoT supervision. This work advances offline imitation learning by leveraging hierarchical planning signals without requiring optimal demos, enabling better transfer to varied tasks and environments.

Abstract

We study generalizable policy learning from demonstrations for complex low-level control (e.g., contact-rich object manipulations). We propose a novel hierarchical imitation learning method that utilizes sub-optimal demos. Firstly, we propose an observation space-agnostic approach that efficiently discovers the multi-step subskill decomposition of the demos in an unsupervised manner. By grouping temporarily close and functionally similar actions into subskill-level demo segments, the observations at the segment boundaries constitute a chain of planning steps for the task, which we refer to as the chain-of-thought (CoT). Next, we propose a Transformer-based design that effectively learns to predict the CoT as the subskill-level guidance. We couple action and subskill predictions via learnable prompt tokens and a hybrid masking strategy, which enable dynamically updated guidance at test time and improve feature representation of the trajectory for generalizable policy learning. Our method, Chain-of-Thought Predictive Control (CoTPC), consistently surpasses existing strong baselines on challenging manipulation tasks with sub-optimal demos.
Paper Structure (75 sections, 7 equations, 8 figures, 5 tables)

This paper contains 75 sections, 7 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: During training, CoTPC learns to jointly predict (1) the next & the last subskill from each CoT token and (2) the low-level actions from each CoT token (for action offset) and state token (for action center). See details in Sec. \ref{['sec:coupled_action_and_cot']}. During inference, when the CoT decoder is not used, the low-level actions are predicted under the guidance of the dynamically updated CoT features. The CoT tokens are all-to-all (can see any tokens). The state and action tokens are causal (can only see previous and CoT tokens). Only 2 attention layers and 3 timesteps are shown for better display.
  • Figure 2: Pairwise similarities of actions at different timesteps in two trajectories for Push Chair (left) and Peg Insertion (right). Action spaces are delta joint velocity and delta joint pose. Visually identifiable blocks along the diagonal are grouped, where actions are temporarily close and functionally similar. This corresponds very well with human intuition of subskills (see Appendix \ref{['app:cos']}).
  • Figure 3: Illustration of the Moving Maze (left), Franka-Kitchen (middle) and some sampled tasks from ManiSkill2 (right), namely Turn Faucet, Peg Insertion, and Push Chair. See detailed descriptions in Sec. \ref{['sec:maze']}, \ref{['sec:kitchen']} and \ref{['sec:ms2']}, respectively.
  • Figure 4: Sampled geometric variations for Push Chair, Turn Faucet, and Peg Insertion. The sizes of peg & box and the relative locations of the hole vary across different env. configs.
  • Figure 5: Illustration of the network data flow of our third ablation study for the variant named vanilla, o-shared, and swapped as well as the original design.
  • ...and 3 more figures