Offline Reinforcement Learning via Linear-Programming with Error-Bound Induced Constraints
Asuman Ozdaglar, Sarath Pattathil, Jiawei Zhang, Kaiqing Zhang
TL;DR
This work studies offline RL through a constrained LP framework that augments classical LP formulations with error-bound induced constraints. By relating occupancy-validity violations to value suboptimality via novel error bounds, the authors derive both dual and primal-dual reformulations that admit tractable solutions with general function approximation. Under a completeness-type assumption, they obtain the optimal $O(1/\sqrt{n})$ rate under single-policy concentrability, including tabular improvements and extensions to average-reward MDPs. They further show a realizability-only, gap-dependent $O(1/\sqrt{n})$ rate by adding a lower-bound constraint on the density ratio and leveraging primal-gap analysis, enabling broad applicability with simple analyses. The unified, constraint-based LP framework thus relaxes key standard assumptions while delivering strong, practically relevant sample-complexity guarantees for offline RL across discounted, average-reward, and contextual-bandit settings.
Abstract
Offline reinforcement learning (RL) aims to find an optimal policy for Markov decision processes (MDPs) using a pre-collected dataset. In this work, we revisit the linear programming (LP) reformulation of Markov decision processes for offline RL, with the goal of developing algorithms with optimal $O(1/\sqrt{n})$ sample complexity, where $n$ is the sample size, under partial data coverage and general function approximation, and with favorable computational tractability. To this end, we derive new \emph{error bounds} for both the dual and primal-dual formulations of the LP, and incorporate them properly as \emph{constraints} in the LP reformulation. We then show that under a completeness-type assumption, $O(1/\sqrt{n})$ sample complexity can be achieved under standard single-policy coverage assumption, when one properly \emph{relaxes} the occupancy validity constraint in the LP. This framework can readily handle both infinite-horizon discounted and average-reward MDPs, in both general function approximation and tabular cases. The instantiation to the tabular case achieves either state-of-the-art or the first sample complexities of offline RL in these settings. To further remove any completeness-type assumption, we then introduce a proper \emph{lower-bound constraint} in the LP, and a variant of the standard single-policy coverage assumption. Such an algorithm leads to a $O(1/\sqrt{n})$ sample complexity with dependence on the \emph{value-function gap}, with only realizability assumptions. Our properly constrained LP framework advances the existing results in several aspects, in relaxing certain assumptions and achieving the optimal $O(1/\sqrt{n})$ sample complexity, with simple analyses. We hope our results bring new insights into the use of LP formulations and the equivalent primal-dual minimax optimization for offline RL, through the error-bound induced constraints.
