Table of Contents
Fetching ...

On the Complexity of Offline Reinforcement Learning with $Q^\star$-Approximation and Partial Coverage

Haolin Liu, Braham Snyder, Chen-Yu Wei

TL;DR

This work addresses offline RL under $Q^\star$-approximation with partial coverage, showing that $Q^\star$-realizability and Bellman completeness alone do not guarantee sample-efficient learning. It introduces a model- and data-driven decision-estimation coefficient (DEC) framework that decomposes suboptimality into decision complexity and estimation error, and develops a second-order performance-difference lemma enabling $1/\varepsilon^2$ sample complexity for regularized offline RL. The paper provides a comprehensive analysis of DEC-based objectives (E2D.OR) and contrasts them with greedy, value-centric approaches (GDE), establishing both gap-adaptive guarantees and practical implications for algorithms like CQL. It further characterizes offline learnability under low-Bellman-rank MDPs, highlights the necessity of double policy sampling and policy feature coverage, and delivers first analyses of CQL beyond tabular settings under $Q^\star$-realizability. Overall, the DEC framework offers a modular, broadly applicable lens that connects offline pessimism with principled decision-guidation, shaping future theory and practice in offline RL.

Abstract

We study offline reinforcement learning under $Q^\star$-approximation and partial coverage, a setting that motivates practical algorithms such as Conservative $Q$-Learning (CQL; Kumar et al., 2020) but has received limited theoretical attention. Our work is inspired by the following open question: "Are $Q^\star$-realizability and Bellman completeness sufficient for sample-efficient offline RL under partial coverage?" We answer in the negative by establishing an information-theoretic lower bound. Going substantially beyond this, we introduce a general framework that characterizes the intrinsic complexity of a given $Q^\star$ function class, inspired by model-free decision-estimation coefficients (DEC) for online RL (Foster et al., 2023b; Liu et al., 2025b). This complexity recovers and improves the quantities underlying the guarantees of Chen and Jiang (2022) and Uehara et al. (2023), and extends to broader settings. Our decision-estimation decomposition can be combined with a wide range of $Q^\star$ estimation procedures, modularizing and generalizing existing approaches. Beyond the general framework, we make further contributions: By developing a novel second-order performance difference lemma, we obtain the first $ε^{-2}$ sample complexity under partial coverage for soft $Q$-learning, improving the $ε^{-4}$ bound of Uehara et al. (2023). We remove Chen and Jiang's (2022) need for additional online interaction when the value gap of $Q^\star$ is unknown. We also give the first characterization of offline learnability for general low-Bellman-rank MDPs without Bellman completeness (Jiang et al., 2017; Du et al., 2021; Jin et al., 2021), a canonical setting in online RL that remains unexplored in offline RL except for special cases. Finally, we provide the first analysis for CQL under $Q^\star$-realizability and Bellman completeness beyond the tabular case.

On the Complexity of Offline Reinforcement Learning with $Q^\star$-Approximation and Partial Coverage

TL;DR

This work addresses offline RL under -approximation with partial coverage, showing that -realizability and Bellman completeness alone do not guarantee sample-efficient learning. It introduces a model- and data-driven decision-estimation coefficient (DEC) framework that decomposes suboptimality into decision complexity and estimation error, and develops a second-order performance-difference lemma enabling sample complexity for regularized offline RL. The paper provides a comprehensive analysis of DEC-based objectives (E2D.OR) and contrasts them with greedy, value-centric approaches (GDE), establishing both gap-adaptive guarantees and practical implications for algorithms like CQL. It further characterizes offline learnability under low-Bellman-rank MDPs, highlights the necessity of double policy sampling and policy feature coverage, and delivers first analyses of CQL beyond tabular settings under -realizability. Overall, the DEC framework offers a modular, broadly applicable lens that connects offline pessimism with principled decision-guidation, shaping future theory and practice in offline RL.

Abstract

We study offline reinforcement learning under -approximation and partial coverage, a setting that motivates practical algorithms such as Conservative -Learning (CQL; Kumar et al., 2020) but has received limited theoretical attention. Our work is inspired by the following open question: "Are -realizability and Bellman completeness sufficient for sample-efficient offline RL under partial coverage?" We answer in the negative by establishing an information-theoretic lower bound. Going substantially beyond this, we introduce a general framework that characterizes the intrinsic complexity of a given function class, inspired by model-free decision-estimation coefficients (DEC) for online RL (Foster et al., 2023b; Liu et al., 2025b). This complexity recovers and improves the quantities underlying the guarantees of Chen and Jiang (2022) and Uehara et al. (2023), and extends to broader settings. Our decision-estimation decomposition can be combined with a wide range of estimation procedures, modularizing and generalizing existing approaches. Beyond the general framework, we make further contributions: By developing a novel second-order performance difference lemma, we obtain the first sample complexity under partial coverage for soft -learning, improving the bound of Uehara et al. (2023). We remove Chen and Jiang's (2022) need for additional online interaction when the value gap of is unknown. We also give the first characterization of offline learnability for general low-Bellman-rank MDPs without Bellman completeness (Jiang et al., 2017; Du et al., 2021; Jin et al., 2021), a canonical setting in online RL that remains unexplored in offline RL except for special cases. Finally, we provide the first analysis for CQL under -realizability and Bellman completeness beyond the tabular case.
Paper Structure (49 sections, 39 theorems, 195 equations, 5 figures, 3 algorithms)

This paper contains 49 sections, 39 theorems, 195 equations, 5 figures, 3 algorithms.

Key Result

Theorem 1

There exists a family of MDPs $\mathcal{M}$, a function class $\mathcal{F}$ with $|\mathcal{F}|=4$, and an offline data distribution $\mu$ such that under any true model $M^\star\in \mathcal{M}$, $Q^\star$-realizability and Bellman completeness hold (with $\mathcal{G}=\mathcal{F}$), and has coverage

Figures (5)

  • Figure 1: Comparison of DEC objectives in online and offline settings for a given discrepancy measure $D$.
  • Figure 2: Four classes of MDPs: $\mathcal{M}_{u,x}$, $\mathcal{M}_{u,y}$, $\mathcal{M}_{v,x}$, and $\mathcal{M}_{v,y}$. Red text highlights the differences between classes, with each class named according to the action pair $(u/v, x/y)$ in the top branch.
  • Figure 3: Construction for $\epsilon$-dependent lower bound with non-trajectory data
  • Figure 4: Construction for $\epsilon$-dependent lower bound with trajectory data (extended from fig: eps-dependent). The blue arrows indicate the transition of action 1, which always leads to a uniform distribution over the same group on the next layer. The red arrows indicate the transition of action 2, which always leads to a uniform distribution over all states on the next layer.
  • Figure 5: The construction of jia2024offline. The numbers on the left of the green allows specify the initial state distribution. There are two actions. Taking either action on state in $X_h\cup Y_h \cup\{z_h\}$ leads to transitions to $u_{h+1}$ and $v_{h+1}$ with probabilities $P(u_{h+1}|s,a) + P(v_{h+1}|s,a) = \frac{2}{H}$ which are not specified in the figure. Blue arrow specifies the transition if taking action $1$ besides those to $u_{h+1}$ and $v_{h+1}$, and red arrow if taking action $2$. The numbers on the red arrows are the transition probabilities. The blue and red arrows without a number on it have a transition probability $\frac{H-2}{H}$. Every transition to group $X_h$ or $Y_h$ results in a uniform distribution over that group. On $u_h$ and $v_h$ and $w_h$, taking any action leads to a deterministic transition to $w_{h+1}$.

Theorems & Definitions (84)

  • Definition 1: Coverage
  • Definition 2: $Q^\star$-realizability
  • Theorem 1
  • Example 1
  • Theorem 2
  • Theorem 3
  • Definition 3: Exploitability Ratio
  • Theorem 4
  • Theorem 5
  • Example 2
  • ...and 74 more