Running VLAs at Real-time Speed

Yunchao Ma; Yizhuang Zhou; Yunhuan Yang; Tiancai Wang; Haoqiang Fan

Running VLAs at Real-time Speed

Yunchao Ma, Yizhuang Zhou, Yunhuan Yang, Tiancai Wang, Haoqiang Fan

TL;DR

The paper demonstrates that a pi0-level vision-language-action (VLA) policy can reach real-time operation on a single consumer GPU by aggressively eliminating CPU and kernel overheads, simplifying computation graphs, and adopting a Full Streaming Inference framework that overlaps vision-language and action-expert processing. It delivers a concrete pipeline achieving $30$ Hz frame-rate and up to $480$ Hz trajectory frequency, with a measured end-to-end latency of $27.3$ ms and a real-world 100% success rate on a falling-pen grasp task. The authors provide a detailed breakdown of optimization steps (CUDA graphs, graph fusion, tile tuning, Partial Split-k, and scalar-fusion) and accompany them with a roofline-based lower bound showing the approach approaches hardware limits while leaving room for future gains. A real-world validation with 600 episodes and a streaming architecture demonstrates practical viability, and the work offers a public implementation to catalyze future development toward higher frequencies and larger models. This work is significant for enabling real-time, latency-sensitive tasks with large VLA models on commodity hardware and lays out a path toward integrating higher-frequency sensing, vision, and language-driven reasoning in robotic control.

Abstract

In this paper, we show how to run pi0-level multi-view VLA at 30Hz frame rate and at most 480Hz trajectory frequency using a single consumer GPU. This enables dynamic and real-time tasks that were previously believed to be unattainable by large VLA models. To achieve it, we introduce a bag of strategies to eliminate the overheads in model inference. The real-world experiment shows that the pi0 policy with our strategy achieves a 100% success rate in grasping a falling pen task. Based on the results, we further propose a full streaming inference framework for real-time robot control of VLA. Code is available at https://github.com/Dexmal/realtime-vla.

Running VLAs at Real-time Speed

TL;DR

Hz frame-rate and up to

Hz trajectory frequency, with a measured end-to-end latency of

ms and a real-world 100% success rate on a falling-pen grasp task. The authors provide a detailed breakdown of optimization steps (CUDA graphs, graph fusion, tile tuning, Partial Split-k, and scalar-fusion) and accompany them with a roofline-based lower bound showing the approach approaches hardware limits while leaving room for future gains. A real-world validation with 600 episodes and a streaming architecture demonstrates practical viability, and the work offers a public implementation to catalyze future development toward higher frequencies and larger models. This work is significant for enabling real-time, latency-sensitive tasks with large VLA models on commodity hardware and lays out a path toward integrating higher-frequency sensing, vision, and language-driven reasoning in robotic control.

Running VLAs at Real-time Speed

TL;DR

Abstract

Running VLAs at Real-time Speed

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)