Table of Contents
Fetching ...

The Role of Environment Access in Agnostic Reinforcement Learning

Akshay Krishnamurthy, Gene Li, Ayush Sekhari

TL;DR

This work probes agnostic policy learning in reinforcement learning with large state spaces, asking how much and what kind of environment access is needed to achieve sample-efficient guarantees without assuming realizability. The authors prove information-theoretic lower bounds under strong access models (generative/local simulators and μ-resets) that prohibit polynomial-sample learning in general, even when the policy class is realizable or with favorable coverage properties. They also demonstrate a positive result under a hybrid resets model for Block MDPs by introducing a policy emulator—paired with the PLHR algorithm—that can approximate the value of all policies in Π using a compact, tabular emulator constructed via resets. These results highlight a fundamental trade-off: extremely weak function approximation assumptions can be rendered tractable by sufficiently strong environment access, with the policy emulator serving as a powerful new tool for policy evaluation in complex RL settings.

Abstract

We study Reinforcement Learning (RL) in environments with large state spaces, where function approximation is required for sample-efficient learning. Departing from a long history of prior work, we consider the weakest possible form of function approximation, called agnostic policy learning, where the learner seeks to find the best policy in a given class $Π$, with no guarantee that $Π$ contains an optimal policy for the underlying task. Although it is known that sample-efficient agnostic policy learning is not possible in the standard online RL setting without further assumptions, we investigate the extent to which this can be overcome with stronger forms of access to the environment. Specifically, we show that: 1. Agnostic policy learning remains statistically intractable when given access to a local simulator, from which one can reset to any previously seen state. This result holds even when the policy class is realizable, and stands in contrast to a positive result of [MFR24] showing that value-based learning under realizability is tractable with local simulator access. 2. Agnostic policy learning remains statistically intractable when given online access to a reset distribution with good coverage properties over the state space (the so-called $μ$-reset setting). We also study stronger forms of function approximation for policy learning, showing that PSDP [BKSN03] and CPI [KL02] provably fail in the absence of policy completeness. 3. On a positive note, agnostic policy learning is statistically tractable for Block MDPs with access to both of the above reset models. We establish this via a new algorithm that carefully constructs a policy emulator: a tabular MDP with a small state space that approximates the value functions of all policies $π\in Π$. These values are approximated without any explicit value function class.

The Role of Environment Access in Agnostic Reinforcement Learning

TL;DR

This work probes agnostic policy learning in reinforcement learning with large state spaces, asking how much and what kind of environment access is needed to achieve sample-efficient guarantees without assuming realizability. The authors prove information-theoretic lower bounds under strong access models (generative/local simulators and μ-resets) that prohibit polynomial-sample learning in general, even when the policy class is realizable or with favorable coverage properties. They also demonstrate a positive result under a hybrid resets model for Block MDPs by introducing a policy emulator—paired with the PLHR algorithm—that can approximate the value of all policies in Π using a compact, tabular emulator constructed via resets. These results highlight a fundamental trade-off: extremely weak function approximation assumptions can be rendered tractable by sufficiently strong environment access, with the policy emulator serving as a powerful new tool for policy evaluation in complex RL settings.

Abstract

We study Reinforcement Learning (RL) in environments with large state spaces, where function approximation is required for sample-efficient learning. Departing from a long history of prior work, we consider the weakest possible form of function approximation, called agnostic policy learning, where the learner seeks to find the best policy in a given class , with no guarantee that contains an optimal policy for the underlying task. Although it is known that sample-efficient agnostic policy learning is not possible in the standard online RL setting without further assumptions, we investigate the extent to which this can be overcome with stronger forms of access to the environment. Specifically, we show that: 1. Agnostic policy learning remains statistically intractable when given access to a local simulator, from which one can reset to any previously seen state. This result holds even when the policy class is realizable, and stands in contrast to a positive result of [MFR24] showing that value-based learning under realizability is tractable with local simulator access. 2. Agnostic policy learning remains statistically intractable when given online access to a reset distribution with good coverage properties over the state space (the so-called -reset setting). We also study stronger forms of function approximation for policy learning, showing that PSDP [BKSN03] and CPI [KL02] provably fail in the absence of policy completeness. 3. On a positive note, agnostic policy learning is statistically tractable for Block MDPs with access to both of the above reset models. We establish this via a new algorithm that carefully constructs a policy emulator: a tabular MDP with a small state space that approximates the value functions of all policies . These values are approximated without any explicit value function class.

Paper Structure

This paper contains 112 sections, 50 theorems, 277 equations, 7 figures, 1 table, 7 algorithms.

Key Result

Theorem 1

Suppose the policy class $\Pi$ satisfies policy completeness (def:policy-completeness), and the reset distribution $\mu$ satisfies concentrability with parameter $C_\mathsf{conc}$. With probability $1 - \delta$, PSDP finds an $\varepsilon$-optimal policy using $\mathrm{poly}( C_\mathsf{conc}, A, H,

Figures (7)

  • Figure 1: Left. Summary of results for policy learning under various forms of access to the MDP. A ✓ indicates there exists an algorithm that adapts to coverage conditions, while ✗ indicates a lower bound showing impossibility. Remarks: For realizability + $\mu$-resets (?), we establish sample-inefficiency for PSDP and CPI (sec:policy-completeness), but impossibility remains open. Two settings are omitted: in online RL, adapting to coverability is impossible (implied by thm:lower-bound-coverability); in offline RL, adapting to concentrability of the offline distribution is impossible jia2024offline. Right. Relationships between interaction models. An arrow $A \boldsymbol{\rightarrow} B$ implies that interaction model $B$ can be simulated using interaction model $A$.
  • Figure 1: Lower bound for PSDP without policy completeness. Red arrows represent action $0$ and blue arrows represent action $1$. In purple we denote the expectation of the stochastic reward. Let $\gamma > 0$ be an arbitrarily small constant. At layer $h=2$, with constant probability, PSDP selects $\textcolor{red}{{\widehat{\pi}}^{(2)} \gets 0}$ since $\mathop{\mathrm{\mathbb{E}}}\nolimits_{x\sim \mu_2} V^{\pi_0}(x) = 1/2$ and $\mathop{\mathrm{\mathbb{E}}}\nolimits_{x\sim \mu_2} V^{\pi_1}(x) = 1/2 + \gamma$. Conditioned on ${\widehat{\pi}}^{(2)} = 0$, we have $\mathop{\mathrm{\mathbb{E}}}\nolimits_{x \sim \mu_1} V^{\pi_0 \circ {\widehat{\pi}}^{(2)}}(x) = 3/4$ while $\mathop{\mathrm{\mathbb{E}}}\nolimits_{x \sim \mu_1} V^{\pi_1 \circ {\widehat{\pi}}^{(2)}}(x) = 1/4$, so therefore PSDP selects $\textcolor{red}{{\widehat{\pi}}^{(1)} \gets 0}$. The returned policy ${\widehat{\pi}}^{(1)} \circ {\widehat{\pi}}^{(2)}$ is $(1+\gamma)$-suboptimal on $d_1$. Note that $\mu = \{\mu_1, \mu_2\}$ satisfies $C_\mathsf{conc} = 4$, and that $\Pi$ satisfies realizability.
  • Figure 2: Construction used for proof of thm:lower-bound-coverability.
  • Figure 3: Construction used for thm:lower-bound-policy-completeness.
  • Figure 4: Illustration of how certifying accuracy of test policies prevents error amplification. Suppose we want to learn the transition $P_\mathsf{lat}(s_{h-1}, a_{h-1}) = s_h$. In $M_\mathsf{lat}$, all policies get value 0 from both $s_h$ and $\bar{s}_h$, with the exception of a special $\color{Purple}{\widetilde{\pi}}$ that gets value $2\Gamma_h$ from $s_h$; in $\widehat{M}_\mathsf{lat}$ all policies get value $\Gamma_h$ from $s_h$ and value 0 from $\bar{s}_h$. Thus, $\widehat{M}_\mathsf{lat}$ satisfies $(A)$ but any test policy $\pi_{s_h, \bar{s}_h} \in \Pi$ will not satisfy (C). It is unlikely that $\pi_{s_h, \bar{s}_h} = \color{Purple}{\widetilde{\pi}}$ is selected, and if we execute any other $\pi$ from the true transition $s_h$, we will observe value $0$, and thus decode the transition to $\widehat{P}_\mathsf{lat}(s_{h-1}, a_{h-1}) = \bar{s}_h$. Therefore, $\lvert Q^{\pi}(s_{h-1}, a_{h-1}) - \widehat{Q}^\pi(s_{h-1}, a_{h-1})\rvert = 2\Gamma_{h}$, thus doubling the policy evaluation error from layer $h$ to $h-1$. Unchecked, this could cause exponential (in $H$) error amplification. Certifying test policy accuracy prevents this, as Refit.D would detect the violation $\lvert V^\pi(s_{h}) -\widehat{V}^\pi(s_{h})\rvert = \Gamma_{h} \gg \epsilon_\mathsf{tol}$ for any $\pi \in \Pi$ and refit $\widehat{M}_\mathsf{lat}$ instead.
  • ...and 2 more figures

Theorems & Definitions (110)

  • Definition 1: Concentrability
  • Definition 2: Policy Completeness
  • Theorem 1
  • Definition 3: Coverability xie2022role
  • Definition 4: Spanning Capacity jia2023agnostic
  • Theorem 2
  • Theorem 3
  • Definition 5: Pushforward Concentrability
  • Theorem 4
  • Theorem 5
  • ...and 100 more