Table of Contents
Fetching ...

Trajectory Data Suffices for Statistically Efficient Policy Evaluation in Finite-Horizon Offline RL with Linear $q^π$-Realizability and Concentrability

Volodymyr Tkachuk, Csaba Szepesvári, Xiaoqi Tan

TL;DR

This work addresses statistical efficiency in finite-horizon offline RL with function approximation under concentration (coverage) and linear $q^\pi$-realizability. By leveraging trajectory data and a linear $q^\pi$ representation, it constructs statistically efficient learners for policy evaluation (Lin-$q^\pi$-FQE) and tightens the analysis for policy optimization (Lin-$q^\pi$-FQI) via skippy Bellman operators. The key innovations include showing that linear realizability across all memoryless policies yields polynomial-sample policy evaluation with bound $n = \tilde{\Theta}(C_0^5 H^7 d^3/\epsilon^2 + L_\phi^2)$ and improving the policy-optimization bound by a factor of $C_0 d$ using a refined analysis; it also formalizes the construction of data-dependent modification sets $\mathbf{G}_{\text{opt}}$ and $\mathbf{G}_{\text{eval}}$ and demonstrates how to exploit trajectory structure to achieve Bellman completeness in a modified MDP. Overall, the results advance the understanding of when offline RL with function approximation can be statistically efficient and provide practical, analyzable learners under realistic data assumptions.

Abstract

We study finite-horizon offline reinforcement learning (RL) with function approximation for both policy evaluation and policy optimization. Prior work established that statistically efficient learning is impossible for either of these problems when the only assumptions are that the data has good coverage (concentrability) and the state-action value function of every policy is linearly realizable ($q^π$-realizability) (Foster et al., 2021). Recently, Tkachuk et al. (2024) gave a statistically efficient learner for policy optimization, if in addition the data is assumed to be given as trajectories. In this work we present a statistically efficient learner for policy evaluation under the same assumptions. Further, we show that the sample complexity of the learner used by Tkachuk et al. (2024) for policy optimization can be improved by a tighter analysis.

Trajectory Data Suffices for Statistically Efficient Policy Evaluation in Finite-Horizon Offline RL with Linear $q^π$-Realizability and Concentrability

TL;DR

This work addresses statistical efficiency in finite-horizon offline RL with function approximation under concentration (coverage) and linear -realizability. By leveraging trajectory data and a linear representation, it constructs statistically efficient learners for policy evaluation (Lin--FQE) and tightens the analysis for policy optimization (Lin--FQI) via skippy Bellman operators. The key innovations include showing that linear realizability across all memoryless policies yields polynomial-sample policy evaluation with bound and improving the policy-optimization bound by a factor of using a refined analysis; it also formalizes the construction of data-dependent modification sets and and demonstrates how to exploit trajectory structure to achieve Bellman completeness in a modified MDP. Overall, the results advance the understanding of when offline RL with function approximation can be statistically efficient and provide practical, analyzable learners under realistic data assumptions.

Abstract

We study finite-horizon offline reinforcement learning (RL) with function approximation for both policy evaluation and policy optimization. Prior work established that statistically efficient learning is impossible for either of these problems when the only assumptions are that the data has good coverage (concentrability) and the state-action value function of every policy is linearly realizable (-realizability) (Foster et al., 2021). Recently, Tkachuk et al. (2024) gave a statistically efficient learner for policy optimization, if in addition the data is assumed to be given as trajectories. In this work we present a statistically efficient learner for policy evaluation under the same assumptions. Further, we show that the sample complexity of the learner used by Tkachuk et al. (2024) for policy optimization can be improved by a tighter analysis.

Paper Structure

This paper contains 18 sections, 25 theorems, 117 equations, 1 figure, 1 table, 2 algorithms.

Key Result

theorem 1

Let ass:q-pi realizabilityass:trajectory dataass:concentrability hold and the behavior policy $\pi^b$ be known. For any evaluation policy $\pi^e, \delta > 0, \epsilon > 0$, if the number of trajectories $n = \tilde{\Theta}(C_0^5 H^7 d^3/\epsilon^2 + L_\phi^2)$, with probability at least $1 - \delta$

Figures (1)

  • Figure 1: The features for both MDPs are $\phi(s_1,\cdot)=(1), \phi(s_3,\cdot)=(0.5),\phi(\cdot,\cdot)=(0)$ otherwise. Left: Linearly $q^\pi$-realizable. Right: Linear MDP, due to skipping $s_1$ via the up action.

Theorems & Definitions (42)

  • theorem 1: Policy Evaluation
  • theorem 2: Policy Optimization
  • Lemma 1: Prop. 4.5 weisz2023online
  • remark 1
  • Lemma 2: Linear $q^\pi$-realizability $\implies$ skippy Bellman completeness
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6: $\mathbf{Q}_\text{opt}$ guarantee
  • Lemma 7: $\mathbf{Q}_\text{eval}$ guarantee
  • ...and 32 more