Table of Contents
Fetching ...

What are the Statistical Limits of Offline RL with Linear Function Approximation?

Ruosong Wang, Dean P. Foster, Sham M. Kakade

TL;DR

This work establishes a fundamental limit for offline reinforcement learning with linear function approximation: even when the Q-functions of all policies are linear in a provided feature map (realizability) and the data exhibit good spectral coverage, any algorithm requires exponential samples in the horizon to non-trivially estimate policy values. The authors prove an information-theoretic lower bound via a carefully constructed hard instance, and they demonstrate that error can be geometrically amplified across multiple steps, making naive LSPE/LSVI approaches infeasible. They also analyze upper bounds under two favorable conditions—low distribution shift and policy completeness—showing that sublinear sample complexity becomes possible only under these stronger assumptions. Overall, the paper clarifies that sample-efficient offline policy evaluation is unattainable without either restricting distribution shift or imposing stronger representation properties beyond realizability, guiding future theoretical and algorithmic developments in offline RL.

Abstract

Offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies. The hope is that offline reinforcement learning coupled with function approximation methods (to deal with the curse of dimensionality) can provide a means to help alleviate the excessive sample complexity burden in modern sequential decision making problems. However, the extent to which this broader approach can be effective is not well understood, where the literature largely consists of sufficient conditions. This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning. Perhaps surprisingly, our main result shows that even if: i) we have realizability in that the true value function of \emph{every} policy is linear in a given set of features and 2) our off-policy data has good coverage over all features (under a strong spectral condition), then any algorithm still (information-theoretically) requires a number of offline samples that is exponential in the problem horizon in order to non-trivially estimate the value of \emph{any} given policy. Our results highlight that sample-efficient offline policy evaluation is simply not possible unless significantly stronger conditions hold; such conditions include either having low distribution shift (where the offline data distribution is close to the distribution of the policy to be evaluated) or significantly stronger representational conditions (beyond realizability).

What are the Statistical Limits of Offline RL with Linear Function Approximation?

TL;DR

This work establishes a fundamental limit for offline reinforcement learning with linear function approximation: even when the Q-functions of all policies are linear in a provided feature map (realizability) and the data exhibit good spectral coverage, any algorithm requires exponential samples in the horizon to non-trivially estimate policy values. The authors prove an information-theoretic lower bound via a carefully constructed hard instance, and they demonstrate that error can be geometrically amplified across multiple steps, making naive LSPE/LSVI approaches infeasible. They also analyze upper bounds under two favorable conditions—low distribution shift and policy completeness—showing that sublinear sample complexity becomes possible only under these stronger assumptions. Overall, the paper clarifies that sample-efficient offline policy evaluation is unattainable without either restricting distribution shift or imposing stronger representation properties beyond realizability, guiding future theoretical and algorithmic developments in offline RL.

Abstract

Offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies. The hope is that offline reinforcement learning coupled with function approximation methods (to deal with the curse of dimensionality) can provide a means to help alleviate the excessive sample complexity burden in modern sequential decision making problems. However, the extent to which this broader approach can be effective is not well understood, where the literature largely consists of sufficient conditions. This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning. Perhaps surprisingly, our main result shows that even if: i) we have realizability in that the true value function of \emph{every} policy is linear in a given set of features and 2) our off-policy data has good coverage over all features (under a strong spectral condition), then any algorithm still (information-theoretically) requires a number of offline samples that is exponential in the problem horizon in order to non-trivially estimate the value of \emph{any} given policy. Our results highlight that sample-efficient offline policy evaluation is simply not possible unless significantly stronger conditions hold; such conditions include either having low distribution shift (where the offline data distribution is close to the distribution of the policy to be evaluated) or significantly stronger representational conditions (beyond realizability).

Paper Structure

This paper contains 35 sections, 7 theorems, 54 equations, 2 figures, 1 algorithm.

Key Result

Theorem 1.1

In the offline RL setting, suppose the data distributions have (polynomially) lower bounded eigenvalues, and the $Q$-functions of every policy are linear with respect to a given feature mapping. Any algorithm requires an exponential number of samples in the horizon $H$ to output a non-trivially accu

Figures (2)

  • Figure 1: An illustration of the hard instance. Recall that $\hat{d} = d/2$. States on the top are those in the first level ($h = 1$), while states at the bottom are those in the last level $(h = H)$. Solid line (with arrow) corresponds to transitions associated with action $a_1$, while dotted line (with arrow) corresponds to transitions associated with action $a_2$. For each level $h \in [H]$, reward values and $Q$-values associated with $s_h^1, s_h^2, \ldots, s_h^{\hat{d}}$ are marked on the left, while reward values and $Q$-values associated with $s_h^{\hat{d} + 1}$ are mark on the right. Rewards and transitions are all deterministic, except for the reward distributions associated with $s_H^1, s_H^2, \ldots, s_H^{\hat{d}}$. We mark the expectation of the reward value when it is stochastic. For each level $h \in [H]$, for the data distribution $\mu_h$, the state is chosen uniformly at random from those states in the dashed rectangle, i.e., $\{s_h^1, s_h^2, \ldots, s_h^{\hat{d}}\}$, while the action is chosen uniformly at random from $\{a_1, a_2\}$. Suppose the initial state is $s_1^{\hat{d} + 1}$. When $r_0 = 0$, the value of the policy is $0$. When $r_0 = {\hat{d}}^{-H/2}$, the value of the policy is $r_0 \cdot {\hat{d}}^{H / 2} = 1$.
  • Figure 2: An illustration of the hard instance. Recall that $\hat{d} = d / 2 - 1$. States on the top are those in the first level ($h = 1$), while states at the bottom are those in the last level $(h = H)$. Dotted line (with arrow) corresponds to transitions associated with actions $a_1, a_2, \ldots, a_{\hat{d}}$, while solid line (with arrow) corresponds to transitions associated with actions $a_{\hat{d} + 1}, a_{\hat{d} + 2}, \ldots, a_d$. We omit the transition associated with $a_1, a_2, \ldots, a_{\hat{d}}$ in the figure if all actions give the same transition. For each level $h \in [H]$, $Q$-values associated with $s_h^1, s_h^2, \ldots, s_h^{\hat{d}}, s_h^+, s_h^-$ are marked on the left, while transition distributions and $Q$-values associated with $s_h^{\hat{d} + 1}$ are marked on the right. Rewards are all deterministic, and the only two states ($s_H^+$ and $s_H^-$) with non-zero reward values are marked in black and grey. Consider the fixed policy that returns $a_d$ for all input states. When $r_0 = 0$, the value of the policy is $0$. When $r_0 = {\hat{d}}^{-(H-2)/2}$, the value of the policy is $= r_0 {\hat{d}}^{(H - 2)/2} = 1$.

Theorems & Definitions (14)

  • Theorem 1.1: Informal
  • Theorem 4.1
  • Remark 1: The sparse reward case
  • Remark 2: Least-Squares Policy Evaluation (LSPE) has exponential variance
  • Remark 3: Least-Squares Value Iteration (LSVI) versus Least-Squares Policy Iteration (LSPI)
  • Lemma 4.2
  • proof
  • Remark 4
  • Lemma 5.1
  • Remark 5
  • ...and 4 more