Table of Contents
Fetching ...

Training-Time Action Conditioning for Efficient Real-Time Chunking

Kevin Black, Allen Z. Ren, Michael Equi, Sergey Levine

TL;DR

The paper tackles latency in real-time control with vision-language-action models by replacing inference-time inpainting with training-time action conditioning that simulates inference delays. By conditioning on a ground-truth action prefix during training and using per-token flow timesteps, the method achieves a drop-in RTC replacement with no runtime overhead. In simulations, training-time RTC outperforms inference-time RTC at higher delays; in real-world tasks on the π0.6 VLA, it maintains performance and speed parity while reducing latency. This approach offers a practical, lightweight path to more reactive robot control without architectural changes.

Abstract

Real-time chunking (RTC) enables vision-language-action models (VLAs) to generate smooth, reactive robot trajectories by asynchronously predicting action chunks and conditioning on previously committed actions via inference-time inpainting. However, this inpainting method introduces computational overhead that increases inference latency. In this work, we propose a simple alternative: simulating inference delay at training time and conditioning on action prefixes directly, eliminating any inference-time overhead. Our method requires no modifications to the model architecture or robot runtime, and can be implemented with only a few additional lines of code. In simulated experiments, we find that training-time RTC outperforms inference-time RTC at higher inference delays. In real-world experiments on box building and espresso making tasks with the $π_{0.6}$ VLA, we demonstrate that training-time RTC maintains both task performance and speed parity with inference-time RTC while being computationally cheaper. Our results suggest that training-time action conditioning is a practical drop-in replacement for inference-time inpainting in real-time robot control.

Training-Time Action Conditioning for Efficient Real-Time Chunking

TL;DR

The paper tackles latency in real-time control with vision-language-action models by replacing inference-time inpainting with training-time action conditioning that simulates inference delays. By conditioning on a ground-truth action prefix during training and using per-token flow timesteps, the method achieves a drop-in RTC replacement with no runtime overhead. In simulations, training-time RTC outperforms inference-time RTC at higher delays; in real-world tasks on the π0.6 VLA, it maintains performance and speed parity while reducing latency. This approach offers a practical, lightweight path to more reactive robot control without architectural changes.

Abstract

Real-time chunking (RTC) enables vision-language-action models (VLAs) to generate smooth, reactive robot trajectories by asynchronously predicting action chunks and conditioning on previously committed actions via inference-time inpainting. However, this inpainting method introduces computational overhead that increases inference latency. In this work, we propose a simple alternative: simulating inference delay at training time and conditioning on action prefixes directly, eliminating any inference-time overhead. Our method requires no modifications to the model architecture or robot runtime, and can be implemented with only a few additional lines of code. In simulated experiments, we find that training-time RTC outperforms inference-time RTC at higher inference delays. In real-world experiments on box building and espresso making tasks with the VLA, we demonstrate that training-time RTC maintains both task performance and speed parity with inference-time RTC while being computationally cheaper. Our results suggest that training-time action conditioning is a practical drop-in replacement for inference-time inpainting in real-time robot control.

Paper Structure

This paper contains 9 sections, 1 equation, 5 figures.

Figures (5)

  • Figure 1: A diagram illustrating two overlapping action chunks. The $d$ actions between $t$ and $t + d$, taken from the previous chunk, are the action prefix (red). From the diagram, we can easily see that we must satisfy the constraint $t + d \leq t - s + H \to d \leq H - s$ to have a valid action prefix. Note that inference-time RTC uses all $H - s$ overlapping actions (red and yellow) to guide the generation of the current chunk, whereas training-time RTC only uses the first $d$ actions (red).
  • Figure 2: An illustration of our conditioning architecture, as applied to a standard diffusion transformer such as the $\pi_{0.6}$ action expert. We always feed in ground-truth, non-noisy prefix actions, while learning to denoise the postfix actions. The flow matching timestep differs between tokens, which indicates the inference delay to the model.
  • Figure 3: Simulated results: inference delay vs. solve rate with a fixed execution horizon of $s = \max(d, 1)$. Training-time RTC performs better than inference-time RTC at inference delays of 2 or higher. Each data point represents 2048 trials, and 95% Wilson score intervals are shaded in.
  • Figure 4: Real-world evaluation tasks: building a cardboard box and making espresso (including grinding, tamping, extracting, and pouring).
  • Figure 5: Real-world results: success rate and duration for espresso making and box building. Training-time and inference-time RTC perform similarly, while both improving speed over synchronous inference. Error bars represent 68% Wilson score intervals for success rate and $\pm1$ SEM for duration.