Table of Contents
Fetching ...

Offline RL via Feature-Occupancy Gradient Ascent

Gergely Neu, Nneka Okolo

TL;DR

This work tackles offline reinforcement learning in infinite-horizon discounted linear MDPs with a known feature map of dimension $d$. It introduces Feature-Occupancy Gradient Ascent (FOGAS), a primal-only gradient-ascent algorithm in the space of feature occupancies, using a least-squares estimator for the transition operator and a softmax policy update, with a stabilization trick to avoid explicit coverage bounds. The authors prove a high-probability, comparator-adaptive bound on suboptimality that scales with a data-coverage term $\|\bm{\lambda}^*\|_{\bm{\Lambda}_n^{-1}}$ and $n$, while not requiring prior knowledge of the coverage ratio; the method also avoids expensive inner loops typical of earlier LP-based approaches. They discuss computational and statistical efficiency, show the bounds hold under the weakest known data-coverage assumptions, and outline extensions to undiscounted or constrained MDPs and broader function-approximation settings.

Abstract

We study offline Reinforcement Learning in large infinite-horizon discounted Markov Decision Processes (MDPs) when the reward and transition models are linearly realizable under a known feature map. Starting from the classic linear-program formulation of the optimal control problem in MDPs, we develop a new algorithm that performs a form of gradient ascent in the space of feature occupancies, defined as the expected feature vectors that can potentially be generated by executing policies in the environment. We show that the resulting simple algorithm satisfies strong computational and sample complexity guarantees, achieved under the least restrictive data coverage assumptions known in the literature. In particular, we show that the sample complexity of our method scales optimally with the desired accuracy level and depends on a weak notion of coverage that only requires the empirical feature covariance matrix to cover a single direction in the feature space (as opposed to covering a full subspace). Additionally, our method is easy to implement and requires no prior knowledge of the coverage ratio (or even an upper bound on it), which altogether make it the strongest known algorithm for this setting to date.

Offline RL via Feature-Occupancy Gradient Ascent

TL;DR

This work tackles offline reinforcement learning in infinite-horizon discounted linear MDPs with a known feature map of dimension . It introduces Feature-Occupancy Gradient Ascent (FOGAS), a primal-only gradient-ascent algorithm in the space of feature occupancies, using a least-squares estimator for the transition operator and a softmax policy update, with a stabilization trick to avoid explicit coverage bounds. The authors prove a high-probability, comparator-adaptive bound on suboptimality that scales with a data-coverage term and , while not requiring prior knowledge of the coverage ratio; the method also avoids expensive inner loops typical of earlier LP-based approaches. They discuss computational and statistical efficiency, show the bounds hold under the weakest known data-coverage assumptions, and outline extensions to undiscounted or constrained MDPs and broader function-approximation settings.

Abstract

We study offline Reinforcement Learning in large infinite-horizon discounted Markov Decision Processes (MDPs) when the reward and transition models are linearly realizable under a known feature map. Starting from the classic linear-program formulation of the optimal control problem in MDPs, we develop a new algorithm that performs a form of gradient ascent in the space of feature occupancies, defined as the expected feature vectors that can potentially be generated by executing policies in the environment. We show that the resulting simple algorithm satisfies strong computational and sample complexity guarantees, achieved under the least restrictive data coverage assumptions known in the literature. In particular, we show that the sample complexity of our method scales optimally with the desired accuracy level and depends on a weak notion of coverage that only requires the empirical feature covariance matrix to cover a single direction in the feature space (as opposed to covering a full subspace). Additionally, our method is easy to implement and requires no prior knowledge of the coverage ratio (or even an upper bound on it), which altogether make it the strongest known algorithm for this setting to date.
Paper Structure (30 sections, 19 theorems, 115 equations, 1 algorithm)

This paper contains 30 sections, 19 theorems, 115 equations, 1 algorithm.

Key Result

Theorem 3.1

Let $\pi_{1}$ be the uniform policy and $\bm{\lambda}_{1}={\boldsymbol{{0}}}$. Also set $D_{\bm{\theta}} = \sqrt{d}/\left(1-\gamma\right)$, $D_{\pi} = \alpha TD_{\bm{\theta}}$ and $\delta > 0$. Suppose that we run FOGAS for $T \ge \frac{2 R^2 n \log A}{\log(1/\delta)}$ rounds with parameters $\beta Then, with probability at least $1-\delta$, the following bound is satisfied for any comparator pol

Theorems & Definitions (25)

  • Definition 2.1: Linear MDP
  • Theorem 3.1
  • Lemma 4.1
  • Lemma 4.2
  • Lemma 4.3
  • Lemma 4.4
  • Lemma 4.5
  • Lemma 4.6
  • Lemma A.1
  • proof
  • ...and 15 more