Table of Contents
Fetching ...

Reinforcement Learning with Action Chunking

Qiyang Li, Zhiyuan Zhou, Sergey Levine

TL;DR

Q-chunking adopts action chunking by directly running RL in a'chunked'action space, enabling the agent to leverage temporally consistent behaviors from offline data for more effective online exploration and use unbiased $n$-step backups for more stable and efficient TD learning.

Abstract

We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased $n$-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.

Reinforcement Learning with Action Chunking

TL;DR

Q-chunking adopts action chunking by directly running RL in a'chunked'action space, enabling the agent to leverage temporally consistent behaviors from offline data for more effective online exploration and use unbiased -step backups for more stable and efficient TD learning.

Abstract

We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased -step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.

Paper Structure

This paper contains 34 sections, 1 theorem, 25 equations, 16 figures, 7 tables, 3 algorithms.

Key Result

Proposition A.1

Let $s_t, a_t, \cdots, s_{t+n}$ be a trajectory segment generated by following a data collection policy $\pi_\beta(a_t, \cdots, a_{t+n} | s_t)$ (i.e., $s_{t+k} \sim T(\cdot \mid s_{t+k-1}, a_{t+k-1}), \forall k\in\{1, \cdots, n\}$, and $r_t, r_{t+1}, \cdots r_{t+n-1}$ be the reward received at each

Figures (16)

  • Figure 2: Naïvely using action chunking for online RL with Gaussian policies leads to poor performance.(1)RLPD runs online RL on both offline data and online replay buffer ball2023efficient. (2)RLPD-AC is the same algorithm as RLPD but operates in a temporally extended action space (action chunk size of 5). (3)QC-RLPD additionally uses a behavior cloning loss on the actor (4 seeds).
  • Figure 3: Robomimic results.QC achieves strong performance across all three robomimic tasks. The first 1M steps are offline and the next 1M steps are online with one environment step per training step (5 seeds).
  • Figure 4: $n$-step return ablations on robomimic. Both Q-chunking methods consistently outperform their $n$-step and 1-step TD counterparts (5 seeds).
  • Figure 5: End-effector movements early in the training and temporal coherency analysis on cube-triple-task3.Left:QC covers a more diverse set of states compared to BFN in the first 1000 environment steps. Right:QC exhibits a higher temporal coherency in end-effector compared to BFN.
  • Figure 6: Sensitivity analysis: action chunk size ($h$), critic ensemble size ($K$), and update-to-data ratio (UTD).Left:QC-FQL with different $h$ on all 5 cube-triple tasks (5 seeds). QC-FQL with $h=1$ is equivalent to FQL. Center: Increasing the ensemble size to $K=10$ improves performance of both QC and BFN on cube-triple-task3 (5 seeds). Right:QC with UTD of 5 on cube-triple-task3 (5 seeds). We report only the online phase results, as all methods achieve near-zero success rates during the offline phase.
  • ...and 11 more figures

Theorems & Definitions (2)

  • Proposition A.1: Q-chunking performs unbiased $n$-step return backup
  • proof