Table of Contents
Fetching ...

VLA Knows Its Limits

Haoxuan Wang, Gengyu Zhang, Yan Yan, Ramana Rao Kompella, Gaowen Liu

TL;DR

This work interprets action self-attention weights as a proxy for the model's predictive limit and proposes AutoHorizon, the first test-time method that dynamically estimates the execution horizon for each predicted action chunk to adapt to changing perceptual conditions.

Abstract

Action chunking has recently emerged as a standard practice in flow-based Vision-Language-Action (VLA) models. However, the effect and choice of the execution horizon - the number of actions to be executed from each predicted chunk - remains underexplored. In this work, we first show that varying the execution horizon leads to substantial performance deviations, with performance initially improving and then declining as the horizon increases. To uncover the reasons, we analyze the cross- and self-attention weights in flow-based VLAs and reveal two key phenomena: (i) intra-chunk actions attend invariantly to vision-language tokens, limiting adaptability to environmental changes; and (ii) the initial and terminal action tokens serve as stable anchors, forming latent centers around which intermediate actions are organized. Motivated by these insights, we interpret action self-attention weights as a proxy for the model's predictive limit and propose AutoHorizon, the first test-time method that dynamically estimates the execution horizon for each predicted action chunk to adapt to changing perceptual conditions. Across simulated and real-world robotic manipulation tasks, AutoHorizon is performant, incurs negligible computational overhead, and generalizes across diverse tasks and flow-based models.

VLA Knows Its Limits

TL;DR

This work interprets action self-attention weights as a proxy for the model's predictive limit and proposes AutoHorizon, the first test-time method that dynamically estimates the execution horizon for each predicted action chunk to adapt to changing perceptual conditions.

Abstract

Action chunking has recently emerged as a standard practice in flow-based Vision-Language-Action (VLA) models. However, the effect and choice of the execution horizon - the number of actions to be executed from each predicted chunk - remains underexplored. In this work, we first show that varying the execution horizon leads to substantial performance deviations, with performance initially improving and then declining as the horizon increases. To uncover the reasons, we analyze the cross- and self-attention weights in flow-based VLAs and reveal two key phenomena: (i) intra-chunk actions attend invariantly to vision-language tokens, limiting adaptability to environmental changes; and (ii) the initial and terminal action tokens serve as stable anchors, forming latent centers around which intermediate actions are organized. Motivated by these insights, we interpret action self-attention weights as a proxy for the model's predictive limit and propose AutoHorizon, the first test-time method that dynamically estimates the execution horizon for each predicted action chunk to adapt to changing perceptual conditions. Across simulated and real-world robotic manipulation tasks, AutoHorizon is performant, incurs negligible computational overhead, and generalizes across diverse tasks and flow-based models.
Paper Structure (26 sections, 1 theorem, 14 equations, 10 figures, 10 tables, 1 algorithm)

This paper contains 26 sections, 1 theorem, 14 equations, 10 figures, 10 tables, 1 algorithm.

Key Result

Proposition 1

Suppose $\delta^{d}_{j}(e)$ is a monotonically increasing function with respect to $e$ and can be modeled as $\delta^{d}_{j}(e) = k e \log e$, where $k > 0$ is a scaling factor. Denote $p$ as the prediction horizon. Assuming $L$ is divisible by $e$, there exists a unique minimizer of $\mathcal{L}(e) such that $\mathcal{L}(e)$ is strictly decreasing on $(0, e^{*})$ and strictly increasing on $(e^{*

Figures (10)

  • Figure 1: Illustration of the average success rates on the LIBERO benchmark using $\pi_{0.5}$. Varying the execution horizon leads to substantial success rate fluctuations, and the policy performance exhibits a peaked pattern, initially improving and then declining as the execution horizon increases.
  • Figure 2: Left: (a) In conventional action chunking, the execution horizon $e$ is heuristically chosen by humans and remains fixed across chunks. (b) In contrast, AutoHorizon (our method) dynamically estimates the execution horizon for each predicted chunk based on the attention weights from the VLA model. Right: Real-world demonstration of showing how the estimated execution horizons evolve during policy rollout. When the environment is stable and reactivity is less critical (e.g., reaching the cube or moving toward the bowl), the estimated horizon increases to promote smooth, stable motion. Conversely, during physical interaction (e.g., grasping or placing the cube), the execution horizon shortens to enhance reactivity and adaptability.
  • Figure 3: Visualization of average attention weights in $\pi_{0.5}$ across different stages of task execution. Intra-chunk actions consistently attend to the same vision and language tokens across predicted chunks throughout the rollout. This invariance is consistently observed across different sampling steps, task rollouts, and pretrained models. The x-axis is rescaled for clarity of visualization.
  • Figure 4: Visualization of normalized action self-attention weights. Across different prediction horizons, the predicted actions exhibit strong attention to the initial and terminal action tokens, with correspondence strength remaining high before sharply decaying as temporal distance increases. These boundary tokens (encircled in black) are referred to as radial action sinks.
  • Figure 5: Estimated execution horizon distributions by AutoHorizon. The legend displays the mean values of the distributions.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Proposition 1: Unique Error Minimizer