Mixture of Horizons in Action Chunking

Dong Jing; Gang Wang; Jiaqi Liu; Weiliang Tang; Zelong Sun; Yunchao Yao; Zhenyu Wei; Yunhui Liu; Zhiwu Lu; Mingyu Ding

Mixture of Horizons in Action Chunking

Dong Jing, Gang Wang, Jiaqi Liu, Weiliang Tang, Zelong Sun, Yunchao Yao, Zhenyu Wei, Yunhui Liu, Zhiwu Lu, Mingyu Ding

TL;DR

This work addresses horizon sensitivity in vision-language-action robotic policies, where horizon length $H$ trades long-term foresight with short-term precision. It introduces Mixture of Horizons (MoH), which processes horizon-specific chunks $A_t^{(h)}$ in parallel with a shared action transformer and fuses them through a light gating head, augmented by a horizon-balance regularizer. Dynamic inference via cross-horizon consensus further boosts stability and efficiency, achieving up to $2.5\times$ throughput and strong performance across LIBERO and RoboTwin. Empirically, MoH yields consistent gains for both flow-matching and one-step regression policies, with $99\%$ average LIBERO success using $\pi_{0.5}$ under mixed-task training. The approach is plug-and-play and computationally lightweight, offering substantial generalization benefits with minimal overhead.

Abstract

Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $π_0$, $π_{0.5}$, and one-step regression policy $π_{\text{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $π_{0.5}$ with MoH reaches a new state-of-the-art with 99$\%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://github.com/Timsty1/MixtureOfHorizons

Mixture of Horizons in Action Chunking

TL;DR

This work addresses horizon sensitivity in vision-language-action robotic policies, where horizon length

trades long-term foresight with short-term precision. It introduces Mixture of Horizons (MoH), which processes horizon-specific chunks

in parallel with a shared action transformer and fuses them through a light gating head, augmented by a horizon-balance regularizer. Dynamic inference via cross-horizon consensus further boosts stability and efficiency, achieving up to

throughput and strong performance across LIBERO and RoboTwin. Empirically, MoH yields consistent gains for both flow-matching and one-step regression policies, with

average LIBERO success using

under mixed-task training. The approach is plug-and-play and computationally lightweight, offering substantial generalization benefits with minimal overhead.

Abstract

Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the

used during training, termed

. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a

strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5

higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies

, and one-step regression policy

demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting,

with MoH reaches a new state-of-the-art with 99

average success rate on LIBERO after only

training iterations. Project page: https://github.com/Timsty1/MixtureOfHorizons

Mixture of Horizons in Action Chunking

TL;DR

Abstract

Mixture of Horizons in Action Chunking

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)