Table of Contents
Fetching ...

Running VLAs at Real-time Speed

Yunchao Ma, Yizhuang Zhou, Yunhuan Yang, Tiancai Wang, Haoqiang Fan

TL;DR

The paper demonstrates that a pi0-level vision-language-action (VLA) policy can reach real-time operation on a single consumer GPU by aggressively eliminating CPU and kernel overheads, simplifying computation graphs, and adopting a Full Streaming Inference framework that overlaps vision-language and action-expert processing. It delivers a concrete pipeline achieving $30$ Hz frame-rate and up to $480$ Hz trajectory frequency, with a measured end-to-end latency of $27.3$ ms and a real-world 100% success rate on a falling-pen grasp task. The authors provide a detailed breakdown of optimization steps (CUDA graphs, graph fusion, tile tuning, Partial Split-k, and scalar-fusion) and accompany them with a roofline-based lower bound showing the approach approaches hardware limits while leaving room for future gains. A real-world validation with 600 episodes and a streaming architecture demonstrates practical viability, and the work offers a public implementation to catalyze future development toward higher frequencies and larger models. This work is significant for enabling real-time, latency-sensitive tasks with large VLA models on commodity hardware and lays out a path toward integrating higher-frequency sensing, vision, and language-driven reasoning in robotic control.

Abstract

In this paper, we show how to run pi0-level multi-view VLA at 30Hz frame rate and at most 480Hz trajectory frequency using a single consumer GPU. This enables dynamic and real-time tasks that were previously believed to be unattainable by large VLA models. To achieve it, we introduce a bag of strategies to eliminate the overheads in model inference. The real-world experiment shows that the pi0 policy with our strategy achieves a 100% success rate in grasping a falling pen task. Based on the results, we further propose a full streaming inference framework for real-time robot control of VLA. Code is available at https://github.com/Dexmal/realtime-vla.

Running VLAs at Real-time Speed

TL;DR

The paper demonstrates that a pi0-level vision-language-action (VLA) policy can reach real-time operation on a single consumer GPU by aggressively eliminating CPU and kernel overheads, simplifying computation graphs, and adopting a Full Streaming Inference framework that overlaps vision-language and action-expert processing. It delivers a concrete pipeline achieving Hz frame-rate and up to Hz trajectory frequency, with a measured end-to-end latency of ms and a real-world 100% success rate on a falling-pen grasp task. The authors provide a detailed breakdown of optimization steps (CUDA graphs, graph fusion, tile tuning, Partial Split-k, and scalar-fusion) and accompany them with a roofline-based lower bound showing the approach approaches hardware limits while leaving room for future gains. A real-world validation with 600 episodes and a streaming architecture demonstrates practical viability, and the work offers a public implementation to catalyze future development toward higher frequencies and larger models. This work is significant for enabling real-time, latency-sensitive tasks with large VLA models on commodity hardware and lays out a path toward integrating higher-frequency sensing, vision, and language-driven reasoning in robotic control.

Abstract

In this paper, we show how to run pi0-level multi-view VLA at 30Hz frame rate and at most 480Hz trajectory frequency using a single consumer GPU. This enables dynamic and real-time tasks that were previously believed to be unattainable by large VLA models. To achieve it, we introduce a bag of strategies to eliminate the overheads in model inference. The real-world experiment shows that the pi0 policy with our strategy achieves a 100% success rate in grasping a falling pen task. Based on the results, we further propose a full streaming inference framework for real-time robot control of VLA. Code is available at https://github.com/Dexmal/realtime-vla.

Paper Structure

This paper contains 28 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Grasping a falling pen. The task has a very stringent time constraint. After observing the pen coming, these is only little time before the action must be initiated. We implemented 30 FPS inference of VLA model so that all frames in our camera stream can be processed, and the end-to-end reaction time can be shorter than 200 ms. This is on par with an average human in this test.
  • Figure 2: Breakdown of the model running time. From a plain naive pytorch implementation, we show how to reduce redundant computation and eliminate CPU overhead (Sec. \ref{['sec:overhead']}). Then we use techniques to optimize the individual kernels (Sec. \ref{['sec:kernels']}). Finally we establish a lower bound (Sec. \ref{['sec:lowerbound']}) that is not far from the current implementation.
  • Figure 3: Transformations to simplify the computational graph. (1) Absorbing RMS affine parameters to the next linear layer; (2) Folding linear layers in action-time embedding; (3) Fusing QKV as one weight matrix.
  • Figure 4: Computation flow of $\pi_0$ model. The model consists of a vision encoder (left), LLM (middle) and action expert (right). All components can be further decomposed into a series of matmuls and associated scalar operations.
  • Figure 5: The Full Streaming Inference framework. AE denotes the action expert.