Table of Contents
Fetching ...

Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance

Wenxuan Song, Jiayi Chen, Shuai Chen, Jingbo Wang, Pengxiang Ding, Han Zhao, Yikai Qin, Xinhu Zheng, Donglin Wang, Yan Wang, Haoang Li

Abstract

This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary tasks. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary task training within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver this goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies. The difference between the resulting model parameters can then be interpreted as capability vectors provided by auxiliary tasks. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Experimental results demonstrate that this approach is highly effective across diverse robot tasks. Project page: https://chris1220313648.github.io/Fast-dVLA/

Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance

Abstract

This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary tasks. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary task training within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver this goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies. The difference between the resulting model parameters can then be interpreted as capability vectors provided by auxiliary tasks. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Experimental results demonstrate that this approach is highly effective across diverse robot tasks. Project page: https://chris1220313648.github.io/Fast-dVLA/

Paper Structure

This paper contains 24 sections, 4 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Speed/Success Rate trade-off. Left (Intra-comparison): Compared to other acceleration strategies for discrete diffusion VLAs (dVLAs), DD-VLA liang2025discrete and Dream-VLA yedreamVLA, our Fast-dVLA achieves a favorable success rate and speed. Here, BlockDiff denotes block diffusion arriola2025block. Right (Inter-comparison): Our Fast-dVLA surpasses autoregressive methods, i.e., $\pi_0$-FAST pertsch2025fast. It also reaches parallel performance and inference frequency with state-of-the-art (SOTA) continuous flow-matching methods, i.e., $\pi_{0.5}$intelligence2025pi_, while maintaining several inherent advantages of dVLAs. We report metrics on LIBERO liu2023libero.
  • Figure 2: Comparison among discrete decoding paradigms. Here, Forward per Sequence denotes the needed forward numbers for a full sequence output, Forward Speed denotes the decoding speed for each forward, and Speed per Sequence (i.e., Inference Speed) denotes the decoding speed for the full sequence output. Our Fast-dVLA requires significantly fewer forward passes and executes each pass efficiently, resulting in substantially faster inference.
  • Figure 3: Visualization of the decoding tendency of action tokens at different positions in Dream-VLA yedreamVLA. Brighter regions indicate higher decoding probability. Despite using bidirectional attention, the model exhibits a clear left-to-right decoding tendency such that action tokens at earlier temporal positions are typically decoded in earlier diffusion iterations. Overall, the decoding process reveals an implicit block-wise AR pattern.
  • Figure 4: KV cache similarity across diffusion iterations under block-diffusion decoding. We visualize the similarity of attention key--value (KV) states for the first action block across different denoising steps. (a): In native dVLA with bidirectional attention, the KV representations evolve across iterations, preventing effective reuse of cached states. (b): In contrast, after adapting dVLA to a block-wise attention architecture via asymmetric distillation, once all tokens in the first block are unmasked, the corresponding KV states remain fixed, enabling efficient KV cache reuse and substantially reducing the computational overhead in subsequent iterations.
  • Figure 5: Block-wise Attention.
  • ...and 4 more figures