A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Linear MDPs

Kihyuk Hong; Ambuj Tewari

A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Linear MDPs

Kihyuk Hong, Ambuj Tewari

TL;DR

The paper tackles offline constrained reinforcement learning in the infinite-horizon discounted setting for linear MDPs, addressing distribution shift via partial/feature data coverage. It introduces a computationally efficient primal-dual algorithm that operates on low-dimensional feature-space quantities and uses a four-player regret analysis to achieve $O(\epsilon^{-2})$ sample complexity under partial coverage, improving upon prior $O(\epsilon^{-4})$ results. A feature-coverage variant is developed, and the method extends to offline CMDPs with multiple constraints under Slater conditions, maintaining the favorable sample complexity. The work is significant for safety-critical, data-limited RL tasks, enabling efficient policy learning with provable near-optimality guarantees under realistic data-coverage assumptions.

Abstract

We study offline reinforcement learning (RL) with linear MDPs under the infinite-horizon discounted setting which aims to learn a policy that maximizes the expected discounted cumulative reward using a pre-collected dataset. Existing algorithms for this setting either require a uniform data coverage assumptions or are computationally inefficient for finding an $ε$-optimal policy with $O(ε^{-2})$ sample complexity. In this paper, we propose a primal dual algorithm for offline RL with linear MDPs in the infinite-horizon discounted setting. Our algorithm is the first computationally efficient algorithm in this setting that achieves sample complexity of $O(ε^{-2})$ with partial data coverage assumption. Our work is an improvement upon a recent work that requires $O(ε^{-4})$ samples. Moreover, we extend our algorithm to work in the offline constrained RL setting that enforces constraints on additional reward signals.

A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Linear MDPs

TL;DR

sample complexity under partial coverage, improving upon prior

results. A feature-coverage variant is developed, and the method extends to offline CMDPs with multiple constraints under Slater conditions, maintaining the favorable sample complexity. The work is significant for safety-critical, data-limited RL tasks, enabling efficient policy learning with provable near-optimality guarantees under realistic data-coverage assumptions.

Abstract

-optimal policy with

sample complexity. In this paper, we propose a primal dual algorithm for offline RL with linear MDPs in the infinite-horizon discounted setting. Our algorithm is the first computationally efficient algorithm in this setting that achieves sample complexity of

with partial data coverage assumption. Our work is an improvement upon a recent work that requires

samples. Moreover, we extend our algorithm to work in the offline constrained RL setting that enforces constraints on additional reward signals.

Paper Structure (45 sections, 27 theorems, 169 equations, 1 table, 2 algorithms)

This paper contains 45 sections, 27 theorems, 169 equations, 1 table, 2 algorithms.

Introduction
Related Work
Offline RL with General Function Approximation
Offline RL with Episodic Setting
Preliminaries
Notations
Linear MDP
Offline Learning and Data Coverage
Algorithm Design
Analysis
Bounding Regret of $\pi$-player
Bounding Regret of $\zeta$-player
Bounding Regret of $\lambda$-player
Algorithm and Main Results
Result on Feature Coverage Assumptions
...and 30 more sections

Key Result

Lemma 1

For a fixed $\bm\lambda(\bm{c}) = \frac{1}{n} \sum_{k = 1}^n c_k \bm\varphi(s_k, a_k)$ with $\vert c_k \vert \leq B$ for $k = 1, \dots, n$, and a policy $\pi$, we have with probability at least $1 - \delta$ conditional on the data of state-action pairs $\{ (s_k, a_k) \}_{k = 1}^n$.

Theorems & Definitions (46)

Lemma 1
Lemma 2: Barycentric spanner
Definition 3
Definition 4
Lemma 5
Lemma 6
Lemma 7
Theorem 8
Theorem 9
Lemma 10: Covering balls. e.g. wainwright2019high
...and 36 more

A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Linear MDPs

TL;DR

Abstract

A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Linear MDPs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (46)