Table of Contents
Fetching ...

Dense Policy: Bidirectional Autoregressive Learning of Actions

Yue Su, Xinyu Zhan, Hongjie Fang, Han Xue, Hao-Shu Fang, Yong-Lu Li, Cewu Lu, Lixin Yang

TL;DR

Dense Policy introduces a bidirectional autoregressive framework for robotic action prediction that expands sparse keyframes into dense action sequences via a coarse-to-fine process with logarithmic-time inference. Built on an encoder-only architecture, it fuses observation features through cross-attention at each expansion level, achieving efficient training and faster inference while maintaining high accuracy. Across 11 simulation tasks in 3 benchmarks and 4 real-world tasks, Dense Policy outperforms holistically generated baselines and unidirectional autoregressive approaches, with ablations confirming the value of bidirectional dependencies for long-horizon manipulation. The work demonstrates strong generalization across 2D/3D perception and real-world settings, while noting potential extensions to broader vision-language-action tasks and scaling to larger models.

Abstract

Mainstream visuomotor policies predominantly rely on generative models for holistic action prediction, while current autoregressive policies, predicting the next token or chunk, have shown suboptimal results. This motivates a search for more effective learning methods to unleash the potential of autoregressive policies for robotic manipulation. This paper introduces a bidirectionally expanded learning approach, termed Dense Policy, to establish a new paradigm for autoregressive policies in action prediction. It employs a lightweight encoder-only architecture to iteratively unfold the action sequence from an initial single frame into the target sequence in a coarse-to-fine manner with logarithmic-time inference. Extensive experiments validate that our dense policy has superior autoregressive learning capabilities and can surpass existing holistic generative policies. Our policy, example data, and training code will be publicly available upon publication. Project page: https: //selen-suyue.github.io/DspNet/.

Dense Policy: Bidirectional Autoregressive Learning of Actions

TL;DR

Dense Policy introduces a bidirectional autoregressive framework for robotic action prediction that expands sparse keyframes into dense action sequences via a coarse-to-fine process with logarithmic-time inference. Built on an encoder-only architecture, it fuses observation features through cross-attention at each expansion level, achieving efficient training and faster inference while maintaining high accuracy. Across 11 simulation tasks in 3 benchmarks and 4 real-world tasks, Dense Policy outperforms holistically generated baselines and unidirectional autoregressive approaches, with ablations confirming the value of bidirectional dependencies for long-horizon manipulation. The work demonstrates strong generalization across 2D/3D perception and real-world settings, while noting potential extensions to broader vision-language-action tasks and scaling to larger models.

Abstract

Mainstream visuomotor policies predominantly rely on generative models for holistic action prediction, while current autoregressive policies, predicting the next token or chunk, have shown suboptimal results. This motivates a search for more effective learning methods to unleash the potential of autoregressive policies for robotic manipulation. This paper introduces a bidirectionally expanded learning approach, termed Dense Policy, to establish a new paradigm for autoregressive policies in action prediction. It employs a lightweight encoder-only architecture to iteratively unfold the action sequence from an initial single frame into the target sequence in a coarse-to-fine manner with logarithmic-time inference. Extensive experiments validate that our dense policy has superior autoregressive learning capabilities and can surpass existing holistic generative policies. Our policy, example data, and training code will be publicly available upon publication. Project page: https: //selen-suyue.github.io/DspNet/.

Paper Structure

This paper contains 17 sections, 5 equations, 19 figures, 2 tables.

Figures (19)

  • Figure 1: Dense Policy: a robot policy model that generates raw robot actions in an autoregressive manner. During inference, Dense Policy performs bidirectional expansion on the current level of sparse action keyframes to obtain a denser action sequence.
  • Figure 3: Overview of Dense Policy. Dense Policy accepts visual inputs in different modalities and optional robot proprioception. It employs a unified encoder to perform cross-attention between hierarchical action representations and observation features. This facilitates a bidirectionally expanding dense process. During each dense process level, the actions, initially represented as sparse keyframes, are progressively infilled and refined into a complete predicted sequence, leading to a coarse-to-fine generation procedure.
  • Figure 5: Learning efficiency of different autoregressive paradigm across four different tasks. The x-axis is the ID of the test time point; The y-axis records the mean of top-5 success rates at the current test time point.
  • Figure 7: Comparison of RISE and Dense Policy in the Flower Arrangement task.
  • Figure 8: Learning Efficiency of three policy models in real-world experiments.
  • ...and 14 more figures