Table of Contents
Fetching ...

Chunk-Boundary Artifact in Action-Chunked Generative Policies: A Noise-Sensitive Failure Mechanism

Rui Wang

Abstract

Action chunking has become a central design choice for generative visuomotor policies, yet the execution discontinuities that arise at chunk boundaries remain poorly understood. In a frozen pretrained action-chunked policy, we identify chunk-boundary artifact as a noise-sensitive failure mechanism. First, artifact is strongly associated with task failure (p < 1e-4, permutation test) and emerges during the rollout rather than only as a post-hoc symptom. Second, under a fixed observation context, changing only latent noise systematically modulates artifact magnitude. Third, by identifying artifact-related directions in noise space and applying trajectory-level steering, we reliably alter artifact magnitude across all evaluated tasks. In hard-task settings with remaining outcome headroom, the success/failure distribution shifts accordingly; on near-ceiling tasks, positive gains are compressed by policy saturation, while the negative causal effect remains visible. Overall, we recast boundary discontinuity from an unavoidable execution nuisance into an analyzable, noise-dominated, and intervenable failure mechanism.

Chunk-Boundary Artifact in Action-Chunked Generative Policies: A Noise-Sensitive Failure Mechanism

Abstract

Action chunking has become a central design choice for generative visuomotor policies, yet the execution discontinuities that arise at chunk boundaries remain poorly understood. In a frozen pretrained action-chunked policy, we identify chunk-boundary artifact as a noise-sensitive failure mechanism. First, artifact is strongly associated with task failure (p < 1e-4, permutation test) and emerges during the rollout rather than only as a post-hoc symptom. Second, under a fixed observation context, changing only latent noise systematically modulates artifact magnitude. Third, by identifying artifact-related directions in noise space and applying trajectory-level steering, we reliably alter artifact magnitude across all evaluated tasks. In hard-task settings with remaining outcome headroom, the success/failure distribution shifts accordingly; on near-ceiling tasks, positive gains are compressed by policy saturation, while the negative causal effect remains visible. Overall, we recast boundary discontinuity from an unavoidable execution nuisance into an analyzable, noise-dominated, and intervenable failure mechanism.
Paper Structure (27 sections, 3 equations, 5 figures, 3 tables)

This paper contains 27 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Matched-horizon action-jerk time courses after truncating success and failure rollouts to a common comparable length. Even under matched horizon, failure trajectories still show stronger jerk pulses around each replanning boundary, indicating that the phenomenon is not driven purely by episode-length differences. Vertical dashed lines mark replanning boundaries every 5 steps.
  • Figure 2: Artifact variation when the observation context is fixed and only the latent noise is changed. Shaded regions indicate the mean $\pm$ standard deviation across noise samples under the same context, and red dots denote reference rollouts. Top: boundary transition jerk. Bottom: local boundary--interior jerk contrast.
  • Figure 3: Artifact-related directions identified on LIBERO-10 task 8. Across 4 contexts, the target artifact metric and the first-boundary jerk contrast vary nearly monotonically with steering strength $\alpha$, indicating that artifact can be stably controlled along specific directions in noise space.
  • Figure 4: Summary of trajectory-level steering. The top row shows the LIBERO-10 ceiling task and the bottom row shows the LIBERO-10 non-ceiling task. Error bars denote 95% confidence intervals: Wilson CIs for success rate and bootstrap CIs for episode boundary--interior jerk contrast. In both settings, targeted-good consistently lowers the boundary--interior jerk contrast and targeted-bad consistently raises it; clear separation in success rate mainly appears in the non-ceiling hard-task setting.
  • Figure A1: Aggregate over the full set of full-trajectory steering experiments included here ($n=158$ per group). Left: success rate. Right: episode boundary--interior jerk contrast. The pool includes ceiling-like, non-ceiling, failure-floor, and one-off settings whose outcome-level interpretation remains uncertain; accordingly, success-rate separation is compressed, while the directional ordering of the mechanistic metric remains visible.