Table of Contents
Fetching ...

Mixture of Horizons in Action Chunking

Dong Jing, Gang Wang, Jiaqi Liu, Weiliang Tang, Zelong Sun, Yunchao Yao, Zhenyu Wei, Yunhui Liu, Zhiwu Lu, Mingyu Ding

TL;DR

This work addresses horizon sensitivity in vision-language-action robotic policies, where horizon length $H$ trades long-term foresight with short-term precision. It introduces Mixture of Horizons (MoH), which processes horizon-specific chunks $A_t^{(h)}$ in parallel with a shared action transformer and fuses them through a light gating head, augmented by a horizon-balance regularizer. Dynamic inference via cross-horizon consensus further boosts stability and efficiency, achieving up to $2.5\times$ throughput and strong performance across LIBERO and RoboTwin. Empirically, MoH yields consistent gains for both flow-matching and one-step regression policies, with $99\%$ average LIBERO success using $\pi_{0.5}$ under mixed-task training. The approach is plug-and-play and computationally lightweight, offering substantial generalization benefits with minimal overhead.

Abstract

Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $π_0$, $π_{0.5}$, and one-step regression policy $π_{\text{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $π_{0.5}$ with MoH reaches a new state-of-the-art with 99$\%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://github.com/Timsty1/MixtureOfHorizons

Mixture of Horizons in Action Chunking

TL;DR

This work addresses horizon sensitivity in vision-language-action robotic policies, where horizon length trades long-term foresight with short-term precision. It introduces Mixture of Horizons (MoH), which processes horizon-specific chunks in parallel with a shared action transformer and fuses them through a light gating head, augmented by a horizon-balance regularizer. Dynamic inference via cross-horizon consensus further boosts stability and efficiency, achieving up to throughput and strong performance across LIBERO and RoboTwin. Empirically, MoH yields consistent gains for both flow-matching and one-step regression policies, with average LIBERO success using under mixed-task training. The approach is plug-and-play and computationally lightweight, offering substantial generalization benefits with minimal overhead.

Abstract

Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the used during training, termed . Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5 higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies , , and one-step regression policy demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, with MoH reaches a new state-of-the-art with 99 average success rate on LIBERO after only training iterations. Project page: https://github.com/Timsty1/MixtureOfHorizons

Paper Structure

This paper contains 23 sections, 17 equations, 14 figures, 5 tables, 1 algorithm.

Figures (14)

  • Figure 1: Effect of action horizon on $\pi_0$. The first $5$ actions in the predicted chunk are executed at evaluation. Varying horizons lead to trade-off effects across four LIBERO task suites. Our mixture of horizons strategy alleviates this trade-off and raises overall success.
  • Figure 2: Overview of the proposed mixture of horizons strategy, which integrates action chunks of multiple horizons via the shared action transformer and a mixture gating mechanism.
  • Figure 3: Overview of our mixture of horizons framework. The action-related input is rearranged into different horizons and then processed in parallel by a shared action transformer. A linear gate head, with only $2k$ parameters, produces per-step, per-horizon weights to fuse horizon-wise predictions into the final action predictions. This strategy is plug-and-play for any full-attention action transformer, including both flow-matching and one-step policies.
  • Figure 4: Comparisons with state-of-the-art methods on RoboTwin 2.0 Benchmark.
  • Figure 5: Visualization of horizon weights of $\pi_{0.5}$ with MoH on LIBERO-Long task suite. The regulation term $L_{bal}$ encourages the distribution balance across horizons. Without $L_{bal}$, the gating weights present obvious distribution preference at all times. The weights of $H3$ drop to $0$ at steps $4$ and $5$ as it is no longer active.
  • ...and 9 more figures