Temporal Action Selection for Action Chunking
Yueyang Weng, Xiaopeng Zhang, Yongjin Mu, Yingcong Zhu, Yanjie Li, Qi Liu
TL;DR
The paper addresses the reactivity-versus-consistency trade-off in action chunking for learning from demonstration by introducing Temporal Action Selector (TAS), which caches action chunks from multiple timesteps and selects actions via a lightweight latent-space, cosine-similarity-based selector. TAS is trained with online RL (PPO) using sparse rewards and a coherence penalty, and can be integrated with residual RL in either a frozen or jointly optimized mode, improving base-policy performance without catastrophic forgetting. Across PushT and FurnitureBench tasks with diverse base policies and noise settings, TAS yields substantial success-rate gains over baselines and demonstrates strong sim-to-real transfer in real-world one_leg experiments. The approach balances reactivity, decision consistency, and motion coherence and shows significant potential for enhancing RL training efficiency and deployment in robotic manipulation.
Abstract
Action chunking is a widely adopted approach in Learning from Demonstration (LfD). By modeling multi-step action chunks rather than single-step actions, action chunking significantly enhances modeling capabilities for human expert policies. However, the reduced decision frequency restricts the utilization of recent observations, degrading reactivity - particularly evident in the inadequate adaptation to sensor noise and dynamic environmental changes. Existing efforts to address this issue have primarily resorted to trading off reactivity against decision consistency, without achieving both. To address this limitation, we propose a novel algorithm, Temporal Action Selector (TAS), which caches predicted action chunks from multiple timesteps and dynamically selects the optimal action through a lightweight selector network. TAS achieves balanced optimization across three critical dimensions: reactivity, decision consistency, and motion coherence. Experiments across multiple tasks with diverse base policies show that TAS significantly improves success rates - yielding an absolute gain of up to 73.3%. Furthermore, integrating TAS as a base policy with residual reinforcement learning (RL) substantially enhances training efficiency and elevates the performance plateau. Experiments in both simulation and physical robots confirm the method's efficacy.
