Table of Contents
Fetching ...

Chunk-Guided Q-Learning

Gwanwoo Song, Kwanyoung Park, Youngwoon Lee

Abstract

In offline reinforcement learning (RL), single-step temporal-difference (TD) learning can suffer from bootstrapping error accumulation over long horizons. Action-chunked TD methods mitigate this by backing up over multiple steps, but can introduce suboptimality by restricting the policy class to open-loop action sequences. To resolve this trade-off, we present Chunk-Guided Q-Learning (CGQ), a single-step TD algorithm that guides a fine-grained single-step critic by regularizing it toward a chunk-based critic trained using temporally extended backups. This reduces compounding error while preserving fine-grained value propagation. We theoretically show that CGQ attains tighter critic optimality bounds than either single-step or action-chunked TD learning alone. Empirically, CGQ achieves strong performance on challenging long-horizon OGBench tasks, often outperforming both single-step and action-chunked methods.

Chunk-Guided Q-Learning

Abstract

In offline reinforcement learning (RL), single-step temporal-difference (TD) learning can suffer from bootstrapping error accumulation over long horizons. Action-chunked TD methods mitigate this by backing up over multiple steps, but can introduce suboptimality by restricting the policy class to open-loop action sequences. To resolve this trade-off, we present Chunk-Guided Q-Learning (CGQ), a single-step TD algorithm that guides a fine-grained single-step critic by regularizing it toward a chunk-based critic trained using temporally extended backups. This reduces compounding error while preserving fine-grained value propagation. We theoretically show that CGQ attains tighter critic optimality bounds than either single-step or action-chunked TD learning alone. Empirically, CGQ achieves strong performance on challenging long-horizon OGBench tasks, often outperforming both single-step and action-chunked methods.
Paper Structure (57 sections, 11 theorems, 88 equations, 8 figures, 10 tables, 1 algorithm)

This paper contains 57 sections, 11 theorems, 88 equations, 8 figures, 10 tables, 1 algorithm.

Key Result

Theorem 4.1

Let $\mathcal{T}$ be a $\gamma$-Lipschitz linear operator in $L_2$ norm, and $\widehat{\mathcal{T}}$ be a stochastic iterative process defined as above. Then, the asymptotic expected squared error satisfies:

Figures (8)

  • Figure 1: Advantage of Chunk-Guided Q-Learning (CGQ) over single-step and action-chunked TD learning.(Left) Single-step TD learning can suffer from compounding error because it bootstraps from its own value estimates; CGQ mitigates this by guiding the critic toward an action-chunked critic trained with temporally extended backups. (Right) Action-chunked TD learning can be suboptimal (in red arrows) because chunked TD backups assume open-loop action sequences, limiting fine-grained stitching across trajectories (gray arrows denote dataset trajectories). In contrast, CGQ can recover the optimal reactive policy (in blue arrows) by retaining single-step TD learning.
  • Figure 2: CGQ improves value estimation over single-step and action-chunked TD learning. We compare single-step TD, action-chunked TD, and CGQ in a simple gridworld using a fixed offline dataset. To simulate function approximation error, we add noise to TD targets. The upper-right panel shows the optimal value function. The left panels visualize the value prediction error after $k \in \{1,3,10,100\}$ TD updates and the middle panels show the learned value functions. Single-step TD exhibits large errors in states far from the goal due to bootstrapping error accumulation over long backup chains, while action-chunked TD propagates values quickly but can converge to a suboptimal critic due to open-loop chunks, leaving persistent errors at intermediate states. In contrast, CGQ achieves lower error (see the lower-right plot) by combining rapid chunked value propagation with fine-grained single-step value backups.
  • Figure 3: Performance of various TD update designs for blending single-step and multi-step TD learning. CGQ's regularization yields the most effective critic.
  • Figure 4: Performance of CGQ across different action-chunk sizes $h$. In general, larger chunks improve performance.
  • Figure 5: Performance of CGQ under varying $\beta$. Choice of $\beta$ is important for the performance.
  • ...and 3 more figures

Theorems & Definitions (20)

  • Theorem 4.1: Error accumulation of one-step TD learning
  • Theorem 4.2: Improved bound via CGQ regularization, informal
  • Theorem 5.1: Bias accumulation of one-step TD learning
  • proof
  • Lemma 5.2
  • proof
  • Lemma 5.3
  • proof
  • Lemma 5.4
  • proof
  • ...and 10 more