Table of Contents
Fetching ...

AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

Yuhua Jiang, Shuang Cheng, Yan Ding, Feifei Gao, Biqing Qi

TL;DR

AsyncVLA addresses the instability of synchronous flow matching in long-horizon VLA tasks by introducing asynchronous flow matching (AFM) and a confidence rater for self-correction. It unifies SFM and AFM within a single model, reusing vision-language KV-cache to maintain efficiency while enabling selective regeneration of low-confidence action tokens. The approach demonstrates robust self-correction and data-efficient learning across LIBERO, WidowX, and Google Robot benchmarks, achieving state-of-the-art results. This work advances embodied AI by combining context-aware asynchronous generation with calibrated confidence-driven refinement in a single, scalable framework.

Abstract

Vision-language-action (VLA) models have recently emerged as a powerful paradigm for building generalist robots. However, traditional VLA models that generate actions through flow matching (FM) typically rely on rigid and uniform time schedules, i.e., synchronous FM (SFM). Without action context awareness and asynchronous self-correction, SFM becomes unstable in long-horizon tasks, where a single action error can cascade into failure. In this work, we propose asynchronous flow matching VLA (AsyncVLA), a novel framework that introduces temporal flexibility in asynchronous FM (AFM) and enables self-correction in action generation. AsyncVLA breaks from the vanilla SFM in VLA models by generating the action tokens in a non-uniform time schedule with action context awareness. Besides, our method introduces the confidence rater to extract confidence of the initially generated actions, enabling the model to selectively refine inaccurate action tokens before execution. Moreover, we propose a unified training procedure for SFM and AFM that endows a single model with both modes, improving KV-cache utilization. Extensive experiments on robotic manipulation benchmarks demonstrate that AsyncVLA is data-efficient and exhibits self-correction ability. AsyncVLA achieves state-of-the-art results across general embodied evaluations due to its asynchronous generation in AFM. Our code is available at https://github.com/YuhuaJiang2002/AsyncVLA.

AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

TL;DR

AsyncVLA addresses the instability of synchronous flow matching in long-horizon VLA tasks by introducing asynchronous flow matching (AFM) and a confidence rater for self-correction. It unifies SFM and AFM within a single model, reusing vision-language KV-cache to maintain efficiency while enabling selective regeneration of low-confidence action tokens. The approach demonstrates robust self-correction and data-efficient learning across LIBERO, WidowX, and Google Robot benchmarks, achieving state-of-the-art results. This work advances embodied AI by combining context-aware asynchronous generation with calibrated confidence-driven refinement in a single, scalable framework.

Abstract

Vision-language-action (VLA) models have recently emerged as a powerful paradigm for building generalist robots. However, traditional VLA models that generate actions through flow matching (FM) typically rely on rigid and uniform time schedules, i.e., synchronous FM (SFM). Without action context awareness and asynchronous self-correction, SFM becomes unstable in long-horizon tasks, where a single action error can cascade into failure. In this work, we propose asynchronous flow matching VLA (AsyncVLA), a novel framework that introduces temporal flexibility in asynchronous FM (AFM) and enables self-correction in action generation. AsyncVLA breaks from the vanilla SFM in VLA models by generating the action tokens in a non-uniform time schedule with action context awareness. Besides, our method introduces the confidence rater to extract confidence of the initially generated actions, enabling the model to selectively refine inaccurate action tokens before execution. Moreover, we propose a unified training procedure for SFM and AFM that endows a single model with both modes, improving KV-cache utilization. Extensive experiments on robotic manipulation benchmarks demonstrate that AsyncVLA is data-efficient and exhibits self-correction ability. AsyncVLA achieves state-of-the-art results across general embodied evaluations due to its asynchronous generation in AFM. Our code is available at https://github.com/YuhuaJiang2002/AsyncVLA.

Paper Structure

This paper contains 22 sections, 6 equations, 5 figures, 6 tables, 2 algorithms.

Figures (5)

  • Figure 1: Comparison of vanilla flow matching and asynchronous flow matching in VLA models. Top: Vanilla flow matching employs a uniform time schedule for all action tokens, generating them synchronously from noise to actions, i.e., synchronous flow matching. Bottom: Asynchronous flow matching dynamically assigns individual time steps to regenerate action tokens. The first-round generated actions provide context information that allows for selective and non-uniform self-correction in the second-round action generation.
  • Figure 2: Overview of the AsyncVLA framework that comprises three components: (a) SFM applies a uniform time schedule $t$ across all action tokens, generating them synchronously from noise ($t=1$) to action ($t=0$). (b) Confidence rater estimates the actions' token-level confidence and mask the low-confidence actions by selecting asynchronous noise for AFM. (c) AFM dynamically assigns individual FM time to each action token, allowing for selective and non-uniform regeneration based on the actions' confidence. SFM and AFM share a single unified model with the same parameters, enabling the VL KV-cache produced by SFM to be reutilized in AFM.
  • Figure 3: Illustration of self-correction ability in AsyncVLA on the LIBERO-Long task suite. The top row shows the first-round actions generated by SFM, and the bottom row shows the second-round actions regenerated by the following AFM.
  • Figure 4: Training loss curve comparison when only part of the LIBERO-Spatial dataset is used for training.
  • Figure 5: Success rate comparison in the training process. Evaluation is conducted on LIBERO-Spatial test suite.