Table of Contents
Fetching ...

FASTER: Rethinking Real-Time Flow VLAs

Yuxiang Lu, Zhe Liu, Xianzhe Fan, Zhenya Yang, Jinghua Hou, Junyi Li, Kaixin Ding, Hengshuang Zhao

Abstract

Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient and forces the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold (e.g., in $π_{0.5}$ and X-VLA) into a single step, while preserving the quality of long-horizon trajectory. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks unprecedented real-time responsiveness for generalist policies, enabling rapid generation of accurate and smooth trajectories.

FASTER: Rethinking Real-Time Flow VLAs

Abstract

Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient and forces the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold (e.g., in and X-VLA) into a single step, while preserving the quality of long-horizon trajectory. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks unprecedented real-time responsiveness for generalist policies, enabling rapid generation of accurate and smooth trajectories.
Paper Structure (29 sections, 10 equations, 13 figures, 11 tables, 2 algorithms)

This paper contains 29 sections, 10 equations, 13 figures, 11 tables, 2 algorithms.

Figures (13)

  • Figure 1: We propose FASTER to alleviate the reaction latency bottleneck in action chunking flow policies. By compressing the sampling iterations of the immediate reaction into a single step, FASTER (bottom) achieves 10$\times$ acceleration compared to original $\pi_{0.5}$ and X-VLA (top). This enables real-time responsiveness in highly dynamic tasks such as playing table tennis. FASTER is a plug-and-play solution for flow-based VLAs, demanding no architectural modifications or additional training.
  • Figure 2: Temporal pipelines of (a) synchronous and (b) asynchronous inference in a robotic system composed of an action chunking policy server and a robot client. As indicated by the best and worst cases, reaction time depends on both inference latency and the interval between consecutive inference-execution cycles. We also illustrate the decomposition of two adjacent action chunks to clarify the discretized inference delay $d$ and the execution horizon $s$ (bounded by $s_{\text{min}}$ and $H-d$) in the asynchronous client.
  • Figure 3: Visualizations of (a) straightness $S(\mathbf{A})$ of the denoising path during sampling of the action chunk, and (b) differences between the intermediate clean action estimates $\tilde{\mathbf{A}}_{t}^{\tau\rightarrow 0}$ at each sampling timestep $\tau$ and the final output $\mathbf{A}_{t}^{0}$.
  • Figure 4: Illustration of (a) constant timestep schedule used in conventional flow sampling and (b) Horizon-Aware Schedule (HAS) used in FASTER that allocates adaptive hit times across the action chunk and accelerates the sampling of early actions, enabling streaming output.
  • Figure 5: Comparison of real-world reaction speed on the table tennis task. Left: Visualization of rollouts on RTX 4090, the third column corresponds to the contact moment, and the interval between each image in a row is 166.7ms. Right: Quantitative completion scores on two GPUs.
  • ...and 8 more figures