Real-Time Execution of Action Chunking Flow Policies
Kevin Black, Manuel Y. Galliker, Sergey Levine
TL;DR
This work tackles the latency challenge in real-time visuomotor control with large action chunking policies by recasting inter-chunk transitions as inpainting tasks. RTC generates the next action chunk while the current one executes, freezing the guaranteed actions and inpainting the rest using flow-matching guidance with soft masking to maintain cross-chunk continuity. The approach is training-free and applicable to diffusion- or flow-based VLAs, and it is validated on a 12-task simulated benchmark plus 6 real-world dexterous tasks, showing improved throughput and robustness to inference delays, including precise tasks like lighting a match. The results indicate RTC can substantially enhance real-time performance in dynamic manipulation while maintaining stability across latency variations. This has practical impact for edge-enabled embodied AI systems that rely on fast, reliable control with complex learned policies.
Abstract
Modern AI systems, especially those interacting with the physical world, increasingly require real-time performance. However, the high latency of state-of-the-art generalist models, including recent vision-language action models (VLAs), poses a significant challenge. While action chunking has enabled temporal consistency in high-frequency control tasks, it does not fully address the latency problem, leading to pauses or out-of-distribution jerky movements at chunk boundaries. This paper presents a novel inference-time algorithm that enables smooth asynchronous execution of action chunking policies. Our method, real-time chunking (RTC), is applicable to any diffusion- or flow-based VLA out of the box with no re-training. It generates the next action chunk while executing the current one, "freezing" actions guaranteed to execute and "inpainting" the rest. To test RTC, we introduce a new benchmark of 12 highly dynamic tasks in the Kinetix simulator, as well as evaluate 6 challenging real-world bimanual manipulation tasks. Results demonstrate that RTC is fast, performant, and uniquely robust to inference delay, significantly improving task throughput and enabling high success rates in precise tasks $\unicode{x2013}$ such as lighting a match $\unicode{x2013}$ even in the presence of significant latency. See https://pi.website/research/real_time_chunking for videos.
