Learning Native Continuation for Action Chunking Flow Policies
Yufeng Liu, Hang Yu, Juntu Zhao, Bocheng Li, Di Zhang, Mingzhu Li, Wenxuan Wu, Yingdong Hu, Junyuan Xie, Junliang Guo, Dequan Wang, Yang Gao
TL;DR
This work tackles discontinuities in action chunking for real-time Vision Language Action policies, caused by inference delay and multimodal action spaces. It introduces Legato, a training-time continuation method that reshapes the flow-based policy dynamics to support per-step, schedule-shaped guidance and training-inference consistency, with randomized schedule conditioning to handle varying latencies. The approach yields smoother trajectories, reduced spurious multimodal switching, shorter task completion times, and robust performance across five real-world manipulation tasks, outperforming Real-Time Chunking (RTC) and training-time RTC. By making continuation a native property of the policy, Legato enables predictable, efficient chunked control suitable for diverse hardware and runtime budgets, with practical impact on real-world robotic manipulation. Key ideas include the action-noise mixture, horizon-wise continuation vector $oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldom}}}}}}}}}}}}}$, schedule-shaped velocity targets, and exact consistency between training and inference dynamics via a closed-form $f_ heta$.
Abstract
Action chunking enables Vision Language Action (VLA) models to run in real time, but naive chunked execution often exhibits discontinuities at chunk boundaries. Real-Time Chunking (RTC) alleviates this issue but is external to the policy, leading to spurious multimodal switching and trajectories that are not intrinsically smooth. We propose Legato, a training-time continuation method for action-chunked flow-based VLA policies. Specifically, Legato initializes denoising from a schedule-shaped mixture of known actions and noise, exposing the model to partial action information. Moreover, Legato reshapes the learned flow dynamics to ensure that the denoising process remains consistent between training and inference under per-step guidance. Legato further uses randomized schedule condition during training to support varying inference delays and achieve controllable smoothness. Empirically, Legato produces smoother trajectories and reduces spurious multimodal switching during execution, leading to less hesitation and shorter task completion time. Extensive real-world experiments show that Legato consistently outperforms RTC across five manipulation tasks, achieving approximately 10% improvements in both trajectory smoothness and task completion time.
