Table of Contents
Fetching ...

DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching

Jiayi Chen, Wenxuan Song, Shuai Chen, Jingbo Wang, Zhijun Li, Haoang Li

Abstract

Vision--Language--Action (VLA) models that encode actions using a discrete tokenization scheme are increasingly adopted for robotic manipulation, but existing decoding paradigms remain fundamentally limited. Whether actions are decoded sequentially by autoregressive VLAs or in parallel by discrete diffusion VLAs, once a token is generated, it is typically fixed and cannot be revised in subsequent iterations, so early token errors cannot be effectively corrected later. We propose DFM-VLA, a discrete flow matching VLA for iterative refinement of action tokens. DFM-VLA~models a token-level probability velocity field that dynamically updates the full action sequence across refinement iterations. We investigate two ways to construct the velocity field: an auxiliary velocity-head formulation and an action-embedding-guided formulation. Our framework further adopts a two-stage decoding strategy with an iterative refinement stage followed by deterministic validation for stable convergence. Extensive experiments on CALVIN, LIBERO, and real-world manipulation tasks show that DFM-VLA consistently outperforms strong autoregressive, discrete diffusion, and continuous diffusion baselines in manipulation performance while retaining high inference efficiency. In particular, DFM-VLA achieves an average success length of 4.44 on CALVIN and an average success rate of 95.7\% on LIBERO, highlighting the value of action refinement via discrete flow matching for robotic manipulation. Our project is available https://chris1220313648.github.io/DFM-VLA/

DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching

Abstract

Vision--Language--Action (VLA) models that encode actions using a discrete tokenization scheme are increasingly adopted for robotic manipulation, but existing decoding paradigms remain fundamentally limited. Whether actions are decoded sequentially by autoregressive VLAs or in parallel by discrete diffusion VLAs, once a token is generated, it is typically fixed and cannot be revised in subsequent iterations, so early token errors cannot be effectively corrected later. We propose DFM-VLA, a discrete flow matching VLA for iterative refinement of action tokens. DFM-VLA~models a token-level probability velocity field that dynamically updates the full action sequence across refinement iterations. We investigate two ways to construct the velocity field: an auxiliary velocity-head formulation and an action-embedding-guided formulation. Our framework further adopts a two-stage decoding strategy with an iterative refinement stage followed by deterministic validation for stable convergence. Extensive experiments on CALVIN, LIBERO, and real-world manipulation tasks show that DFM-VLA consistently outperforms strong autoregressive, discrete diffusion, and continuous diffusion baselines in manipulation performance while retaining high inference efficiency. In particular, DFM-VLA achieves an average success length of 4.44 on CALVIN and an average success rate of 95.7\% on LIBERO, highlighting the value of action refinement via discrete flow matching for robotic manipulation. Our project is available https://chris1220313648.github.io/DFM-VLA/

Paper Structure

This paper contains 27 sections, 9 equations, 7 figures, 8 tables, 2 algorithms.

Figures (7)

  • Figure 1: Comparison of decoding paradigms. 1) Autoregressive decoding requires as many steps as the action sequence length, while 2) Discrete Diffusion enables faster generation through parallel token updates. However, both have the same limitation that once an erroneous token is produced, it cannot be corrected in later iterations. We refer to this phenomenon as irreversible commitment. In contrast, 3) our DFM-VLA performs full-sequence action refinement at every iteration, allowing token-level correction and improving action quality for robotic manipulation.
  • Figure 2: Comparison of three discrete VLA paradigms. AR VLA uses causal attention to generate future action tokens from previously generated ones, DD-VLA decodes only masked tokens, while DFM-VLA iteratively refines tokens over the full sequence.
  • Figure 3: Overall architecture of DFM-VLA. Given language--vision context and noised action tokens $x_t$, the model predicts clean actions $x_1$ and learns the velocity field via $\mathcal{L}_{\text{ce}}$ or ${L_\text{head}}$.
  • Figure 4: Visualization of a single decoding step in the iterative refinement stage. After predicting final state $x^\text{pred}_1$, the model does not directly output final action tokens. Instead, it constructs a velocity field to compute transition rates and selectively updates tokens to next state $x_{t+h}$ at each step.
  • Figure 5: Comparison of velocity field constructions across training steps. The Embedding-guided variant consistently outperforms the Velocity Head variant in both convergence speed and final performance, reaching a state-of-the-art success rate of 95.7% on LIBERO with only 20k training steps.
  • ...and 2 more figures