Table of Contents
Fetching ...

Learning Native Continuation for Action Chunking Flow Policies

Yufeng Liu, Hang Yu, Juntu Zhao, Bocheng Li, Di Zhang, Mingzhu Li, Wenxuan Wu, Yingdong Hu, Junyuan Xie, Junliang Guo, Dequan Wang, Yang Gao

TL;DR

This work tackles discontinuities in action chunking for real-time Vision Language Action policies, caused by inference delay and multimodal action spaces. It introduces Legato, a training-time continuation method that reshapes the flow-based policy dynamics to support per-step, schedule-shaped guidance and training-inference consistency, with randomized schedule conditioning to handle varying latencies. The approach yields smoother trajectories, reduced spurious multimodal switching, shorter task completion times, and robust performance across five real-world manipulation tasks, outperforming Real-Time Chunking (RTC) and training-time RTC. By making continuation a native property of the policy, Legato enables predictable, efficient chunked control suitable for diverse hardware and runtime budgets, with practical impact on real-world robotic manipulation. Key ideas include the action-noise mixture, horizon-wise continuation vector $oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldsymbol{oldom}}}}}}}}}}}}}$, schedule-shaped velocity targets, and exact consistency between training and inference dynamics via a closed-form $f_ heta$.

Abstract

Action chunking enables Vision Language Action (VLA) models to run in real time, but naive chunked execution often exhibits discontinuities at chunk boundaries. Real-Time Chunking (RTC) alleviates this issue but is external to the policy, leading to spurious multimodal switching and trajectories that are not intrinsically smooth. We propose Legato, a training-time continuation method for action-chunked flow-based VLA policies. Specifically, Legato initializes denoising from a schedule-shaped mixture of known actions and noise, exposing the model to partial action information. Moreover, Legato reshapes the learned flow dynamics to ensure that the denoising process remains consistent between training and inference under per-step guidance. Legato further uses randomized schedule condition during training to support varying inference delays and achieve controllable smoothness. Empirically, Legato produces smoother trajectories and reduces spurious multimodal switching during execution, leading to less hesitation and shorter task completion time. Extensive real-world experiments show that Legato consistently outperforms RTC across five manipulation tasks, achieving approximately 10% improvements in both trajectory smoothness and task completion time.

Learning Native Continuation for Action Chunking Flow Policies

TL;DR

This work tackles discontinuities in action chunking for real-time Vision Language Action policies, caused by inference delay and multimodal action spaces. It introduces Legato, a training-time continuation method that reshapes the flow-based policy dynamics to support per-step, schedule-shaped guidance and training-inference consistency, with randomized schedule conditioning to handle varying latencies. The approach yields smoother trajectories, reduced spurious multimodal switching, shorter task completion times, and robust performance across five real-world manipulation tasks, outperforming Real-Time Chunking (RTC) and training-time RTC. By making continuation a native property of the policy, Legato enables predictable, efficient chunked control suitable for diverse hardware and runtime budgets, with practical impact on real-world robotic manipulation. Key ideas include the action-noise mixture, horizon-wise continuation vector , schedule-shaped velocity targets, and exact consistency between training and inference dynamics via a closed-form .

Abstract

Action chunking enables Vision Language Action (VLA) models to run in real time, but naive chunked execution often exhibits discontinuities at chunk boundaries. Real-Time Chunking (RTC) alleviates this issue but is external to the policy, leading to spurious multimodal switching and trajectories that are not intrinsically smooth. We propose Legato, a training-time continuation method for action-chunked flow-based VLA policies. Specifically, Legato initializes denoising from a schedule-shaped mixture of known actions and noise, exposing the model to partial action information. Moreover, Legato reshapes the learned flow dynamics to ensure that the denoising process remains consistent between training and inference under per-step guidance. Legato further uses randomized schedule condition during training to support varying inference delays and achieve controllable smoothness. Empirically, Legato produces smoother trajectories and reduces spurious multimodal switching during execution, leading to less hesitation and shorter task completion time. Extensive real-world experiments show that Legato consistently outperforms RTC across five manipulation tasks, achieving approximately 10% improvements in both trajectory smoothness and task completion time.
Paper Structure (54 sections, 23 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 54 sections, 23 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Legato reduces task completion time while improving trajectory smoothness compared to RTC black2025real. Across five real-world manipulation tasks, Legato consistently achieves shorter execution time and lower NSPARC Balasubramanian2015OnTA (indicating smoother trajectories, discussed in \ref{['sec:metrics']}) than RTC. The bottom plot shows an example execution trace on the pour task, as defined in \ref{['sec:tasks']}, where Legato produces smoother action trajectories with fewer hesitation-induced slowdowns than RTC.
  • Figure 2: Overview of Legato with schedule-shaped continuation dynamics. The schedule parameters are defined as follows: $s$ is the executed length per cycle, $d$ sets the fully guided prefix (inference delay), and $r$ controls the ramp-down length of the guidance schedule over the remaining horizon. Given $\boldsymbol{\omega}$, Legato initializes actions via an action–noise mixture and learns a reshaped velocity field so that the native schedule effect is realized during multi-step denoising.
  • Figure 3: One-shot prefix guidance cannot preserve prefix constraints during denoising. Trajectories show three dimensions of the overlap (prefix) actions across denoising steps; colors indicate diffusion times $t$ (from $1$ to $0$), and GT denotes the ground-truth prefix. Although clamped at initialization, the overlap actions drift from the reference as denoising proceeds, motivating the need for per-step guidance. Evaluated on the pour task, as defined in \ref{['sec:tasks']}.
  • Figure 4: Real-world evaluation tasks on a dual-arm robot. We consider five manipulation tasks (stack bowls, pour things, pick and place, fold towel and open drawer) covering diverse motion patterns and multimodal choices such as alternative grasp goals and left/right arm selection.
  • Figure 5: Legato suppresses spurious multimodal switching across chunk boundaries. In a representative bowl-stacking rollout, RTC alternates (arrow) between competing grasp goals (green circle) and execution arms (red circle) over successive chunks, producing visibly hesitant corrections. Legato preserves a consistent grasp goal and arm choice (blue circle), leading to steadier progress.
  • ...and 2 more figures