Table of Contents
Fetching ...

HiFlow: Tokenization-Free Scale-Wise Autoregressive Policy Learning via Flow Matching

Daichi Yashima, Koki Seno, Shuhei Kurita, Yusuke Oda, Komei Sugiura

Abstract

Coarse-to-fine autoregressive modeling has recently shown strong promise for visuomotor policy learning, combining the inference efficiency of autoregressive methods with the global trajectory coherence of diffusion-based policies. However, existing approaches rely on discrete action tokenizers that map continuous action sequences to codebook indices, a design inherited from image generation where learned compression is necessary for high-dimensional pixel data. We observe that robot actions are inherently low-dimensional continuous vectors, for which such tokenization introduces unnecessary quantization error and a multi-stage training pipeline. In this work, we propose Hierarchical Flow Policy (HiFlow), a tokenization-free coarse-to-fine autoregressive policy that operates directly on raw continuous actions. HiFlow constructs multi-scale continuous action targets from each action chunk via simple temporal pooling. Specifically, it averages contiguous action windows to produce coarse summaries that are refined at finer temporal resolutions. The entire model is trained end-to-end in a single stage, eliminating the need for a separate tokenizer. Experiments on MimicGen, RoboTwin 2.0, and real-world environments demonstrate that HiFlow consistently outperforms existing methods including diffusion-based and tokenization-based autoregressive policies.

HiFlow: Tokenization-Free Scale-Wise Autoregressive Policy Learning via Flow Matching

Abstract

Coarse-to-fine autoregressive modeling has recently shown strong promise for visuomotor policy learning, combining the inference efficiency of autoregressive methods with the global trajectory coherence of diffusion-based policies. However, existing approaches rely on discrete action tokenizers that map continuous action sequences to codebook indices, a design inherited from image generation where learned compression is necessary for high-dimensional pixel data. We observe that robot actions are inherently low-dimensional continuous vectors, for which such tokenization introduces unnecessary quantization error and a multi-stage training pipeline. In this work, we propose Hierarchical Flow Policy (HiFlow), a tokenization-free coarse-to-fine autoregressive policy that operates directly on raw continuous actions. HiFlow constructs multi-scale continuous action targets from each action chunk via simple temporal pooling. Specifically, it averages contiguous action windows to produce coarse summaries that are refined at finer temporal resolutions. The entire model is trained end-to-end in a single stage, eliminating the need for a separate tokenizer. Experiments on MimicGen, RoboTwin 2.0, and real-world environments demonstrate that HiFlow consistently outperforms existing methods including diffusion-based and tokenization-based autoregressive policies.

Paper Structure

This paper contains 20 sections, 9 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 2: Tokenized vs tokenization-free scale-wise policy learning. Prior next-scale autoregressive policies discretize multi-scale action chunks using a VQ-VAE tokenizer and learn to predict code indices with cross-entropy, which introduces a separate tokenizer pretraining stage and incurs quantization error. HiFlow instead refines raw continuous actions from coarse to fine temporal scales. A scale-wise autoregressive transformer provides coarse-to-fine conditioning, and a shared conditional flow matching module generates continuous actions at each scale, eliminating the tokenizer while preserving fine-grained control.
  • Figure 3: Architecture overview of HiFlow. Given visual observations, proprioceptive states, and a task identifier, the scale-wise autoregressive Transformer (ScaleAR) produces conditioning features at progressively finer temporal scales via a scale-wise causal mask. A shared ActionFlowNet then generates continuous actions at each scale through conditional flow matching, progressively refining the trajectory from a single-token global summary (scale 1) to the full $T$-step action chunk. The entire pipeline operates in continuous action space without any discrete tokenization.
  • Figure 4: Task overview across the three evaluation benchmarks. Top: 8 single-arm manipulation tasks from MimicGen. Middle: 3 dual-arm tasks from RoboTwin 2.0. Bottom: 5 real-world tasks with a mobile manipulator, spanning object grasping, relocation, and target placement.
  • Figure 5: Qualitative results of HiFlow on representative tasks from each simulation benchmark. (a) Threading from MimicGen. (b) Place can basket from RoboTwin 2.0.
  • Figure 6: Qualitative results of HiFlow on the Orange$\rightarrow$Plate task from the real-world experiments. Note that the third-person view are not included in the observation of the model.
  • ...and 1 more figures