Table of Contents
Fetching ...

Real-Time Execution of Action Chunking Flow Policies

Kevin Black, Manuel Y. Galliker, Sergey Levine

TL;DR

This work tackles the latency challenge in real-time visuomotor control with large action chunking policies by recasting inter-chunk transitions as inpainting tasks. RTC generates the next action chunk while the current one executes, freezing the guaranteed actions and inpainting the rest using flow-matching guidance with soft masking to maintain cross-chunk continuity. The approach is training-free and applicable to diffusion- or flow-based VLAs, and it is validated on a 12-task simulated benchmark plus 6 real-world dexterous tasks, showing improved throughput and robustness to inference delays, including precise tasks like lighting a match. The results indicate RTC can substantially enhance real-time performance in dynamic manipulation while maintaining stability across latency variations. This has practical impact for edge-enabled embodied AI systems that rely on fast, reliable control with complex learned policies.

Abstract

Modern AI systems, especially those interacting with the physical world, increasingly require real-time performance. However, the high latency of state-of-the-art generalist models, including recent vision-language action models (VLAs), poses a significant challenge. While action chunking has enabled temporal consistency in high-frequency control tasks, it does not fully address the latency problem, leading to pauses or out-of-distribution jerky movements at chunk boundaries. This paper presents a novel inference-time algorithm that enables smooth asynchronous execution of action chunking policies. Our method, real-time chunking (RTC), is applicable to any diffusion- or flow-based VLA out of the box with no re-training. It generates the next action chunk while executing the current one, "freezing" actions guaranteed to execute and "inpainting" the rest. To test RTC, we introduce a new benchmark of 12 highly dynamic tasks in the Kinetix simulator, as well as evaluate 6 challenging real-world bimanual manipulation tasks. Results demonstrate that RTC is fast, performant, and uniquely robust to inference delay, significantly improving task throughput and enabling high success rates in precise tasks $\unicode{x2013}$ such as lighting a match $\unicode{x2013}$ even in the presence of significant latency. See https://pi.website/research/real_time_chunking for videos.

Real-Time Execution of Action Chunking Flow Policies

TL;DR

This work tackles the latency challenge in real-time visuomotor control with large action chunking policies by recasting inter-chunk transitions as inpainting tasks. RTC generates the next action chunk while the current one executes, freezing the guaranteed actions and inpainting the rest using flow-matching guidance with soft masking to maintain cross-chunk continuity. The approach is training-free and applicable to diffusion- or flow-based VLAs, and it is validated on a 12-task simulated benchmark plus 6 real-world dexterous tasks, showing improved throughput and robustness to inference delays, including precise tasks like lighting a match. The results indicate RTC can substantially enhance real-time performance in dynamic manipulation while maintaining stability across latency variations. This has practical impact for edge-enabled embodied AI systems that rely on fast, reliable control with complex learned policies.

Abstract

Modern AI systems, especially those interacting with the physical world, increasingly require real-time performance. However, the high latency of state-of-the-art generalist models, including recent vision-language action models (VLAs), poses a significant challenge. While action chunking has enabled temporal consistency in high-frequency control tasks, it does not fully address the latency problem, leading to pauses or out-of-distribution jerky movements at chunk boundaries. This paper presents a novel inference-time algorithm that enables smooth asynchronous execution of action chunking policies. Our method, real-time chunking (RTC), is applicable to any diffusion- or flow-based VLA out of the box with no re-training. It generates the next action chunk while executing the current one, "freezing" actions guaranteed to execute and "inpainting" the rest. To test RTC, we introduce a new benchmark of 12 highly dynamic tasks in the Kinetix simulator, as well as evaluate 6 challenging real-world bimanual manipulation tasks. Results demonstrate that RTC is fast, performant, and uniquely robust to inference delay, significantly improving task throughput and enabling high success rates in precise tasks such as lighting a match even in the presence of significant latency. See https://pi.website/research/real_time_chunking for videos.

Paper Structure

This paper contains 20 sections, 3 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Top: Real-time chunking (RTC) enables the robot to perform highly dexterous and dynamic tasks, such as lighting a match---even in the presence of inference delays in excess of 300 milliseconds, corresponding to more than 30% of the model's prediction horizon. Bottom: RTC performs the same robot motion 20% faster than synchronous inference black2024pikim2024openvlabrohan2023rtintelligence2025pikim2025fineteam2024octo, and smoother than all competing methods, including temporal ensembling zhao2023learning. The shown positions, velocities, and accelerations correspond to the shoulder joint of one arm, and are taken from the first 10 seconds of a real autonomous match-lighting rollout.
  • Figure 2: An illustration of a typical bifurcation between consecutive chunks. Inference is started between timesteps 3 and 4. The original chunk that was executing, $\{a_t\}$ (black), had planned to go above the obstacle while the newly generated chunk $\{a_t'\}$ (red) goes below the obstacle. However, $\{a_t'\}$ is not available until $d = 7$ steps later. A naive asynchronous algorithm might jump from $a_{10}$ to $a_{11}'$, inducing a very high, out-of-distribution acceleration. Temporal ensembling zhao2023learning, i.e., interpolating between chunks, reduces the acceleration but produces poor actions.
  • Figure 3: A diagram illustrating how action generation attends to the previous action chunk in real-time chunking. If inference starts after the execution of $a_{-1}$ and the inference delay is $d = 4$, then the newly generated chunk will not be available until after $a_3$ is consumed. Therefore, $a_{0:3}$ are "frozen" and are attended to with a full guidance weight of 1. In the intermediate region, $a_{4:10}$, actions from the previous chunk are available but may be updated, since inference will have finished before $a_4$ is needed. This region is attended to with an exponentially decreasing guidance weight. Finally, the last $s = 5$ actions are beyond the end of the previous chunk, and need to be freshly generated. The execution horizon, $s$, is a hyperparameter constrained by $d \leq s \leq H - d$.
  • Figure 4: A comparison of naive inpainting (hard masking) and our proposed soft masking method: note that hard masking does not match the frozen region very well and produces faster changes in direction.
  • Figure 5: Top left: Kinetix environments; each involves getting a green object on the left to touch a blue one on the right. Bottom left: Execution horizon vs. solve rate with a fixed inference delay of 1. Only RTC and BID take full advantage of faster updates, showing strictly increasing performance with decreasing execution horizon. Right: Inference delay vs. solve rate with a fixed execution horizon of $s = \max(d, 1)$. RTC outperforms all baselines. Furthermore, soft masking (Sec. \ref{['sec:soft_masking']}) improves performance at lower inference delays and execution horizons. Each data point represents 2048 trials, and 95% Wilson score intervals are shaded in.
  • ...and 3 more figures