Table of Contents
Fetching ...

Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation

Wenbo Zhang, Tianrun Hu, Hanbo Zhang, Yanyuan Qiao, Yuchu Qin, Yang Li, Jiajun Liu, Tao Kong, Lingqiao Liu, Xiao Ma

TL;DR

Chain-of-Action (CoA) reframes robotic manipulation as backward trajectory generation from a task-specific keyframe within a single autoregressive model. By modeling the trajectory distribution in reverse and anchoring to a goal action, CoA enforces a global-to-local consistency that mitigates compounding errors and enhances spatial generalization. The approach is strengthened by four design components—continuous action tokens, dynamic stopping, reverse temporal ensemble, and multi-token prediction—along with latent consistency regularization. Empirically, CoA achieves state-of-the-art performance on 60 RLBench tasks and 8 real-world manipulation tasks, with strong evidence of improved spatial generalization and robust closed-loop execution. These results indicate trajectory autoregressive modeling as a competitive alternative for visuo-motor policy learning and real-world robotic control.

Abstract

We present Chain-of-Action (CoA), a novel visuo-motor policy paradigm built upon Trajectory Autoregressive Modeling. Unlike conventional approaches that predict next step action(s) forward, CoA generates an entire trajectory by explicit backward reasoning with task-specific goals through an action-level Chain-of-Thought (CoT) process. This process is unified within a single autoregressive structure: (1) the first token corresponds to a stable keyframe action that encodes the task-specific goals; and (2) subsequent action tokens are generated autoregressively, conditioned on the initial keyframe and previously predicted actions. This backward action reasoning enforces a global-to-local structure, allowing each local action to be tightly constrained by the final goal. To further realize the action reasoning structure, CoA incorporates four complementary designs: continuous action token representation; dynamic stopping for variable-length trajectory generation; reverse temporal ensemble; and multi-token prediction to balance action chunk modeling with global structure. As a result, CoA gives strong spatial generalization capabilities while preserving the flexibility and simplicity of a visuo-motor policy. Empirically, we observe CoA achieves the state-of-the-art performance across 60 RLBench tasks and 8 real-world manipulation tasks.

Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation

TL;DR

Chain-of-Action (CoA) reframes robotic manipulation as backward trajectory generation from a task-specific keyframe within a single autoregressive model. By modeling the trajectory distribution in reverse and anchoring to a goal action, CoA enforces a global-to-local consistency that mitigates compounding errors and enhances spatial generalization. The approach is strengthened by four design components—continuous action tokens, dynamic stopping, reverse temporal ensemble, and multi-token prediction—along with latent consistency regularization. Empirically, CoA achieves state-of-the-art performance on 60 RLBench tasks and 8 real-world manipulation tasks, with strong evidence of improved spatial generalization and robust closed-loop execution. These results indicate trajectory autoregressive modeling as a competitive alternative for visuo-motor policy learning and real-world robotic control.

Abstract

We present Chain-of-Action (CoA), a novel visuo-motor policy paradigm built upon Trajectory Autoregressive Modeling. Unlike conventional approaches that predict next step action(s) forward, CoA generates an entire trajectory by explicit backward reasoning with task-specific goals through an action-level Chain-of-Thought (CoT) process. This process is unified within a single autoregressive structure: (1) the first token corresponds to a stable keyframe action that encodes the task-specific goals; and (2) subsequent action tokens are generated autoregressively, conditioned on the initial keyframe and previously predicted actions. This backward action reasoning enforces a global-to-local structure, allowing each local action to be tightly constrained by the final goal. To further realize the action reasoning structure, CoA incorporates four complementary designs: continuous action token representation; dynamic stopping for variable-length trajectory generation; reverse temporal ensemble; and multi-token prediction to balance action chunk modeling with global structure. As a result, CoA gives strong spatial generalization capabilities while preserving the flexibility and simplicity of a visuo-motor policy. Empirically, we observe CoA achieves the state-of-the-art performance across 60 RLBench tasks and 8 real-world manipulation tasks.

Paper Structure

This paper contains 16 sections, 2 equations, 9 figures, 10 tables, 2 algorithms.

Figures (9)

  • Figure 1: Comparison between a conventional visuo-motor policy (left) and our proposed Chain-of-Action (right). The former is optimized to predict step-wise actions based on current observations, rather than long-term goals, often leading misaligned behaviors during execution. In contrast, Chain-of-Action adopts a backward generation paradigm, producing goal-conditioned trajectories that reliably execute toward the intended target.
  • Figure 2: Chain-of-Action built on trajectory autoregressive modeling. The left part illustrates the network architecture where notation is for the training stage, and the right part illustrates the execution process. The model encodes visual and proprioceptive observations and generates actions in reverse order from a predicted keyframe action by an autoregressive decoder. For clarity, the keyframe action $a_T$ is shown in green, and subsequent steps are visualized with a gradual color transition.
  • Figure 3: Visualization of predicted sub-trajectories across 10 widely used tasks. Detail refers to Table \ref{['tab:task_success']}. Red waypoints represent ground-truth trajectories, and green waypoints denote model predictions. Each predicted trajectory is generated backward from a keyframe action to the current gripper state, enabling consistent goal-conditioned trajectory generation. The model successfully handles both straight and curved motion patterns.
  • Figure 4: Success rate improvement on RLBench-60, sorted by improvement from high to low. The average success rate over all tasks is shown in the inset on the right.
  • Figure 5: Correlation between success rate and spatial variance. Left image: Overall success rate decreases as object spatial variance increases. Middle and right image: CoA consistently outperforms ACT and DP across varying spatial generalization levels, with larger advantages in more challenging (higher variance) settings. Table: Pearson correlations highlight CoA’s robustness to spatial perturbations.
  • ...and 4 more figures