A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Linear MDPs
Kihyuk Hong, Ambuj Tewari
TL;DR
The paper tackles offline constrained reinforcement learning in the infinite-horizon discounted setting for linear MDPs, addressing distribution shift via partial/feature data coverage. It introduces a computationally efficient primal-dual algorithm that operates on low-dimensional feature-space quantities and uses a four-player regret analysis to achieve $O(\epsilon^{-2})$ sample complexity under partial coverage, improving upon prior $O(\epsilon^{-4})$ results. A feature-coverage variant is developed, and the method extends to offline CMDPs with multiple constraints under Slater conditions, maintaining the favorable sample complexity. The work is significant for safety-critical, data-limited RL tasks, enabling efficient policy learning with provable near-optimality guarantees under realistic data-coverage assumptions.
Abstract
We study offline reinforcement learning (RL) with linear MDPs under the infinite-horizon discounted setting which aims to learn a policy that maximizes the expected discounted cumulative reward using a pre-collected dataset. Existing algorithms for this setting either require a uniform data coverage assumptions or are computationally inefficient for finding an $ε$-optimal policy with $O(ε^{-2})$ sample complexity. In this paper, we propose a primal dual algorithm for offline RL with linear MDPs in the infinite-horizon discounted setting. Our algorithm is the first computationally efficient algorithm in this setting that achieves sample complexity of $O(ε^{-2})$ with partial data coverage assumption. Our work is an improvement upon a recent work that requires $O(ε^{-4})$ samples. Moreover, we extend our algorithm to work in the offline constrained RL setting that enforces constraints on additional reward signals.
