Table of Contents
Fetching ...

SEAR: Sample Efficient Action Chunking Reinforcement Learning

C. F. Maximilian Nagy, Onur Celik, Emiliyan Gospodinov, Florian Seligmann, Weiran Liao, Aryan Kaushik, Gerhard Neumann

TL;DR

SEAR is introduced, an off policy online reinforcement learning algorithm for action chunking that exploits the temporal structure of action chunks and operates with a receding horizon, effectively combining the benefits of small and large chunk sizes.

Abstract

Action chunking can improve exploration and value estimation in long horizon reinforcement learning, but makes learning substantially harder since the critic must evaluate action sequences rather than single actions, greatly increasing approximation and data efficiency challenges. As a result, existing action chunking methods, primarily designed for the offline and offline-to-online settings, have not achieved strong performance in purely online reinforcement learning. We introduce SEAR, an off policy online reinforcement learning algorithm for action chunking. It exploits the temporal structure of action chunks and operates with a receding horizon, effectively combining the benefits of small and large chunk sizes. SEAR outperforms state of the art online reinforcement learning methods on Metaworld, training with chunk sizes up to 20.

SEAR: Sample Efficient Action Chunking Reinforcement Learning

TL;DR

SEAR is introduced, an off policy online reinforcement learning algorithm for action chunking that exploits the temporal structure of action chunks and operates with a receding horizon, effectively combining the benefits of small and large chunk sizes.

Abstract

Action chunking can improve exploration and value estimation in long horizon reinforcement learning, but makes learning substantially harder since the critic must evaluate action sequences rather than single actions, greatly increasing approximation and data efficiency challenges. As a result, existing action chunking methods, primarily designed for the offline and offline-to-online settings, have not achieved strong performance in purely online reinforcement learning. We introduce SEAR, an off policy online reinforcement learning algorithm for action chunking. It exploits the temporal structure of action chunks and operates with a receding horizon, effectively combining the benefits of small and large chunk sizes. SEAR outperforms state of the art online reinforcement learning methods on Metaworld, training with chunk sizes up to 20.
Paper Structure (30 sections, 9 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 30 sections, 9 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Action Chunking increases sample efficiency in online reinforcement learning. The figure visualizes the aggregated performances on 20 hard Metaworld environments. Naively applying action chunking to state-of-the-art RL methods (SimbaV2 (Chunked)) yields degraded performance compared to the single step policy (SimbaV2) lee2025simbav2. Applying recent action chunking methods (CQN-AS) seo2024coarse, specifically designed for the offline2online setting, fails to improve the success rate over the single-step policy. SEAR (with chunksize $10$) improves performance significantly while being more sample efficient, showing that action chunking with a transformer critic, multi-horizon targets, and random replanning yields efficient and stable training with action chunking policies.
  • Figure 2: Overview of SEAR's (a) critic and (b) actor updates. The critic function expects the state $s_t$ and an action chunk $a_{t:t + N}$ of chunk size $N$ as argument. During the critic update (a), these training points are sampled from the replay buffer and the causal transformer critic predicts Q-values $Q^{(1)}(s_t,a_t),Q^{(2)}(s_t,a_ {t:t+1}),...,Q^{(N)}(s_t,a_{t+N-1})$ for all action chunk prefixes $a_{t:t + n}, n \in \{0, ..., N - 1\}$ thereby generating multi-horizon predictions leading to an increased amount of training data for the subsequent critic update. The subfigure (b) visualizes the actor update. Given a state $s_t$, the actor predicts an action chunk $a_{t:t + N-1}$ of size $N$ that is subsequently passed to the causal transformer to obtain the expected Q-value prediction $Q^{(N)}(s_t,a_{t:t + N-1})$, which is then used to update the actor's parameters based on the maximum entropy RL objective.
  • Figure 3: Improved state coverage by randomized replanning. The figure visualizes trajectories of point-mass agents moving left-to-right. All agents employ action chunking with chunk size $N=4$. The states in which the policy is evaluated are marked as a rectangle. In the top row, all actions sampled from the action chunking policy are executed. This causes the policy to only be evaluated on subsets of the state-space (rectangles). If receding horizons are used with such a policy it will be evaluated out-of-distribution, resulting in subpar performance. Instead, we propose to only execute a random prefix of each action chunk (bottom row), which leads to a more diversified state coverage on which the policy is evaluated.
  • Figure 4: Performance analysis on design choices (a) and receding horizon (b). We compare SEAR-10 as the base performance to its design choices (a). Naive Chunking without SEAR's features does not yield good performance, whereas discarding Multi-Horizon leads to the highest performance drop. Simply replacing SEAR's transformer critic with an MLP Critic while keeping the other features is still better than discarding multi-horizon targets, but still leads to a significant performance drop. Only discarding Random Replanning is the closest to the standard one-step RL policy (No Action Chunking), but still worse than SEAR's base performance. (b) analyzes the effect of receding horizon when discarding SEAR's key algorithmic features. While receding horizon yields improved performance for smaller chunk sizes for the base performance SEAR-10, the MLP Critic, and the action chunking baseline CQN-AS, it does not seem to have a big effect when discarding multi-horizon targets and when random replanning is not enabled. Receding horizon seems to harm the naive chunking policy's performance.
  • Figure 5: Varying chunk sizes and their effect on the sample efficiency (a) and the receding horizon performance (b). An action chunk size of $N=1$ (SEAR-1) matches SimbaV2's lee2025simbav2 performance on Metaworld's 20 hardest tasks, while there seems to be a sweet spot for a chunk size of $N=10$ (SEAR-10), which outperforms SEAR-5 and SEAR-20 (a). (b) visualizes SEAR's performance as a function of different replanning intervals $k$ for varying chunk sizes. A replanning horizon of $k=4$ shows the optimal performance independent of the chunk size $N$. Larger chunk size policies evaluated with a receding horizon lead to a better performance than training smaller chunk size policies. For example, SEAR-20 with a chunksize $N=20$ performs better than SEAR-10 at a replanning interval $k=10$.
  • ...and 4 more figures