Table of Contents
Fetching ...

Offline Reinforcement Learning: Role of State Aggregation and Trajectory Data

Zeyu Jia, Alexander Rakhlin, Ayush Sekhari, Chen-Yu Wei

TL;DR

The paper investigates offline policy evaluation under value-function realizability without Bellman completeness and shows that even with bounded concentrability and trajectory data, polynomial sample complexity cannot be guaranteed in general. It identifies aggregated concentrability in an aggregated Markov transition model as the governing quantity for sample complexity, and proves that this coefficient can grow exponentially with horizon, even when the original MDP has small concentrability and admissible data. A general reduction demonstrates that trajectory data offers no extra benefit in the worst case for OPE, and two protocols (Replicator and admissible-to-trajectory) underpin this hardness result. On the positive side, the authors provide a BVFT-based upper bound whose sample complexity scales with the aggregated concentrability, bridging lower and upper bounds and clarifying the limitations and potential of offline policy evaluation with function approximation. Altogether, the work highlights fundamental barriers to tractable offline RL under realizability alone and motivates future exploration of structural assumptions or algorithmic innovations to exploit trajectory data or relax key conditions.

Abstract

We revisit the problem of offline reinforcement learning with value function realizability but without Bellman completeness. Previous work by Xie and Jiang (2021) and Foster et al. (2022) left open the question whether a bounded concentrability coefficient along with trajectory-based offline data admits a polynomial sample complexity. In this work, we provide a negative answer to this question for the task of offline policy evaluation. In addition to addressing this question, we provide a rather complete picture for offline policy evaluation with only value function realizability. Our primary findings are threefold: 1) The sample complexity of offline policy evaluation is governed by the concentrability coefficient in an aggregated Markov Transition Model jointly determined by the function class and the offline data distribution, rather than that in the original MDP. This unifies and generalizes the ideas of Xie and Jiang (2021) and Foster et al. (2022), 2) The concentrability coefficient in the aggregated Markov Transition Model may grow exponentially with the horizon length, even when the concentrability coefficient in the original MDP is small and the offline data is admissible (i.e., the data distribution equals the occupancy measure of some policy), 3) Under value function realizability, there is a generic reduction that can convert any hard instance with admissible data to a hard instance with trajectory data, implying that trajectory data offers no extra benefits over admissible data. These three pieces jointly resolve the open problem, though each of them could be of independent interest.

Offline Reinforcement Learning: Role of State Aggregation and Trajectory Data

TL;DR

The paper investigates offline policy evaluation under value-function realizability without Bellman completeness and shows that even with bounded concentrability and trajectory data, polynomial sample complexity cannot be guaranteed in general. It identifies aggregated concentrability in an aggregated Markov transition model as the governing quantity for sample complexity, and proves that this coefficient can grow exponentially with horizon, even when the original MDP has small concentrability and admissible data. A general reduction demonstrates that trajectory data offers no extra benefit in the worst case for OPE, and two protocols (Replicator and admissible-to-trajectory) underpin this hardness result. On the positive side, the authors provide a BVFT-based upper bound whose sample complexity scales with the aggregated concentrability, bridging lower and upper bounds and clarifying the limitations and potential of offline policy evaluation with function approximation. Altogether, the work highlights fundamental barriers to tractable offline RL under realizability alone and motivates future exploration of structural assumptions or algorithmic innovations to exploit trajectory data or relax key conditions.

Abstract

We revisit the problem of offline reinforcement learning with value function realizability but without Bellman completeness. Previous work by Xie and Jiang (2021) and Foster et al. (2022) left open the question whether a bounded concentrability coefficient along with trajectory-based offline data admits a polynomial sample complexity. In this work, we provide a negative answer to this question for the task of offline policy evaluation. In addition to addressing this question, we provide a rather complete picture for offline policy evaluation with only value function realizability. Our primary findings are threefold: 1) The sample complexity of offline policy evaluation is governed by the concentrability coefficient in an aggregated Markov Transition Model jointly determined by the function class and the offline data distribution, rather than that in the original MDP. This unifies and generalizes the ideas of Xie and Jiang (2021) and Foster et al. (2022), 2) The concentrability coefficient in the aggregated Markov Transition Model may grow exponentially with the horizon length, even when the concentrability coefficient in the original MDP is small and the offline data is admissible (i.e., the data distribution equals the occupancy measure of some policy), 3) Under value function realizability, there is a generic reduction that can convert any hard instance with admissible data to a hard instance with trajectory data, implying that trajectory data offers no extra benefits over admissible data. These three pieces jointly resolve the open problem, though each of them could be of independent interest.
Paper Structure (66 sections, 46 theorems, 232 equations, 6 figures, 4 algorithms)

This paper contains 66 sections, 46 theorems, 232 equations, 6 figures, 4 algorithms.

Key Result

Theorem 1

Let $\varepsilon \in (0, 1)$, $M$ be a Markov Transition Model, $\Phi$ be an aggregation scheme over the states of $M$, ${\pi_{\mathrm{e}}}$ be a deterministic evaluation policy in $M$ such that for any aggregation $\phi \in \Phi$ and states $x, x' \in \phi$ it holds that ${\pi_{\mathrm{e}}}(x) = {\ Furthermore, any offline policy evaluation algorithm that guarantees to estimate the value of ${\pi

Figures (6)

  • Figure 1: Markov Transition Model and aggregation scheme in eg:arbitrary_behavior. The blue arrows represent the transitions under action $\mathfrak{a}_1$, and the red arrows represent the transitions under $\mathfrak{a}_2$. The green arrows denote the initial distribution $\rho$.
  • Figure 2: Dynamics for policy ${\pi_{\mathrm{e}}}$ in the aggregated MDP$\bar{M}$ in eg:arbitrary_behavior, where $p_h\vcentcolon={} \bar{T}(*){\phi_{h+1}^{[1]}\mid \phi_{h}^{[1]}, {\pi_{\mathrm{e}}}; \bar{M}}$. As shown in lem:sec-admissible-mu-2, $p_h\ge \frac{H-1}{H+2}$ for all $1\le h\le H-1$. The green arrows denote the initial distribution $\rho$ in the aggregated MDP.
  • Figure 3: Lower bound construction for the proof of thm:realizable-H. For each $h \in [H]$, the corresponding block denotes the $K$ layers that are obtained using Replicator by replicating the $h$-th layer in the given MDP$M$ for $K$ many times. The solid red arrows represent the transitions under the action $\mathfrak{a}_1$, the dotted blue arrows represent the transitions under the $\mathfrak{a}_2$ (under which we resample from the admissible distribution $\mu_h$), the solid green arrows denote the transitions according to the original MDP$M$.
  • Figure 4: Lower bound construction used in the proof sketch of thm: informal general data. States are represented with circles and the corresponding state aggregations are represented with rectangles. We use the bold rectangle to denote the set of aggregations ${\mathcal{I}}$ that attains the maximum in the definition of $\bar{\mathsf{C}}_\varepsilon(M,\Phi,\mu)$ (see def:agg-concentrability). For simplicity, in the above figure $\mathcal{I}$ only contains a single aggregation that contains a single latent state ${z}^\star$, while in general $\mathcal{I}$ may include multiple aggregations each with multiple latent states.
  • Figure 5: Lower bound construction in thm:realizable-H. The blue arrows represent the transitions under the action $\mathfrak{a}_1$, and the red arrows represent the transitions under the action $\mathfrak{a}_2$. In the middle layer, the arrows to the blocks $\mathcal{X}_2^1$ and $\mathcal{X}_2^2$ denote uniform transitions to the states within those blocks.
  • ...and 1 more figures

Theorems & Definitions (61)

  • Definition 1: Aggregated Concentrability Coefficient
  • Theorem 1
  • Example 1
  • Proposition 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Definition 2: Pushforward Concentrability Coefficient and Action Concentrability Coefficient; xie2021batch
  • Definition 3: Block MDP; du2019provablymisra2020kinematic
  • Definition 4: $W$-function of OPE problems
  • ...and 51 more