Table of Contents
Fetching ...

A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Linear MDPs

Kihyuk Hong, Ambuj Tewari

TL;DR

The paper tackles offline constrained reinforcement learning in the infinite-horizon discounted setting for linear MDPs, addressing distribution shift via partial/feature data coverage. It introduces a computationally efficient primal-dual algorithm that operates on low-dimensional feature-space quantities and uses a four-player regret analysis to achieve $O(\epsilon^{-2})$ sample complexity under partial coverage, improving upon prior $O(\epsilon^{-4})$ results. A feature-coverage variant is developed, and the method extends to offline CMDPs with multiple constraints under Slater conditions, maintaining the favorable sample complexity. The work is significant for safety-critical, data-limited RL tasks, enabling efficient policy learning with provable near-optimality guarantees under realistic data-coverage assumptions.

Abstract

We study offline reinforcement learning (RL) with linear MDPs under the infinite-horizon discounted setting which aims to learn a policy that maximizes the expected discounted cumulative reward using a pre-collected dataset. Existing algorithms for this setting either require a uniform data coverage assumptions or are computationally inefficient for finding an $ε$-optimal policy with $O(ε^{-2})$ sample complexity. In this paper, we propose a primal dual algorithm for offline RL with linear MDPs in the infinite-horizon discounted setting. Our algorithm is the first computationally efficient algorithm in this setting that achieves sample complexity of $O(ε^{-2})$ with partial data coverage assumption. Our work is an improvement upon a recent work that requires $O(ε^{-4})$ samples. Moreover, we extend our algorithm to work in the offline constrained RL setting that enforces constraints on additional reward signals.

A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Linear MDPs

TL;DR

The paper tackles offline constrained reinforcement learning in the infinite-horizon discounted setting for linear MDPs, addressing distribution shift via partial/feature data coverage. It introduces a computationally efficient primal-dual algorithm that operates on low-dimensional feature-space quantities and uses a four-player regret analysis to achieve sample complexity under partial coverage, improving upon prior results. A feature-coverage variant is developed, and the method extends to offline CMDPs with multiple constraints under Slater conditions, maintaining the favorable sample complexity. The work is significant for safety-critical, data-limited RL tasks, enabling efficient policy learning with provable near-optimality guarantees under realistic data-coverage assumptions.

Abstract

We study offline reinforcement learning (RL) with linear MDPs under the infinite-horizon discounted setting which aims to learn a policy that maximizes the expected discounted cumulative reward using a pre-collected dataset. Existing algorithms for this setting either require a uniform data coverage assumptions or are computationally inefficient for finding an -optimal policy with sample complexity. In this paper, we propose a primal dual algorithm for offline RL with linear MDPs in the infinite-horizon discounted setting. Our algorithm is the first computationally efficient algorithm in this setting that achieves sample complexity of with partial data coverage assumption. Our work is an improvement upon a recent work that requires samples. Moreover, we extend our algorithm to work in the offline constrained RL setting that enforces constraints on additional reward signals.
Paper Structure (45 sections, 27 theorems, 169 equations, 1 table, 2 algorithms)

This paper contains 45 sections, 27 theorems, 169 equations, 1 table, 2 algorithms.

Key Result

Lemma 1

For a fixed $\bm\lambda(\bm{c}) = \frac{1}{n} \sum_{k = 1}^n c_k \bm\varphi(s_k, a_k)$ with $\vert c_k \vert \leq B$ for $k = 1, \dots, n$, and a policy $\pi$, we have with probability at least $1 - \delta$ conditional on the data of state-action pairs $\{ (s_k, a_k) \}_{k = 1}^n$.

Theorems & Definitions (46)

  • Lemma 1
  • Lemma 2: Barycentric spanner
  • Definition 3
  • Definition 4
  • Lemma 5
  • Lemma 6
  • Lemma 7
  • Theorem 8
  • Theorem 9
  • Lemma 10: Covering balls. e.g. wainwright2019high
  • ...and 36 more