Table of Contents
Fetching ...

Offline Reinforcement Learning via Linear-Programming with Error-Bound Induced Constraints

Asuman Ozdaglar, Sarath Pattathil, Jiawei Zhang, Kaiqing Zhang

TL;DR

This work studies offline RL through a constrained LP framework that augments classical LP formulations with error-bound induced constraints. By relating occupancy-validity violations to value suboptimality via novel error bounds, the authors derive both dual and primal-dual reformulations that admit tractable solutions with general function approximation. Under a completeness-type assumption, they obtain the optimal $O(1/\sqrt{n})$ rate under single-policy concentrability, including tabular improvements and extensions to average-reward MDPs. They further show a realizability-only, gap-dependent $O(1/\sqrt{n})$ rate by adding a lower-bound constraint on the density ratio and leveraging primal-gap analysis, enabling broad applicability with simple analyses. The unified, constraint-based LP framework thus relaxes key standard assumptions while delivering strong, practically relevant sample-complexity guarantees for offline RL across discounted, average-reward, and contextual-bandit settings.

Abstract

Offline reinforcement learning (RL) aims to find an optimal policy for Markov decision processes (MDPs) using a pre-collected dataset. In this work, we revisit the linear programming (LP) reformulation of Markov decision processes for offline RL, with the goal of developing algorithms with optimal $O(1/\sqrt{n})$ sample complexity, where $n$ is the sample size, under partial data coverage and general function approximation, and with favorable computational tractability. To this end, we derive new \emph{error bounds} for both the dual and primal-dual formulations of the LP, and incorporate them properly as \emph{constraints} in the LP reformulation. We then show that under a completeness-type assumption, $O(1/\sqrt{n})$ sample complexity can be achieved under standard single-policy coverage assumption, when one properly \emph{relaxes} the occupancy validity constraint in the LP. This framework can readily handle both infinite-horizon discounted and average-reward MDPs, in both general function approximation and tabular cases. The instantiation to the tabular case achieves either state-of-the-art or the first sample complexities of offline RL in these settings. To further remove any completeness-type assumption, we then introduce a proper \emph{lower-bound constraint} in the LP, and a variant of the standard single-policy coverage assumption. Such an algorithm leads to a $O(1/\sqrt{n})$ sample complexity with dependence on the \emph{value-function gap}, with only realizability assumptions. Our properly constrained LP framework advances the existing results in several aspects, in relaxing certain assumptions and achieving the optimal $O(1/\sqrt{n})$ sample complexity, with simple analyses. We hope our results bring new insights into the use of LP formulations and the equivalent primal-dual minimax optimization for offline RL, through the error-bound induced constraints.

Offline Reinforcement Learning via Linear-Programming with Error-Bound Induced Constraints

TL;DR

This work studies offline RL through a constrained LP framework that augments classical LP formulations with error-bound induced constraints. By relating occupancy-validity violations to value suboptimality via novel error bounds, the authors derive both dual and primal-dual reformulations that admit tractable solutions with general function approximation. Under a completeness-type assumption, they obtain the optimal rate under single-policy concentrability, including tabular improvements and extensions to average-reward MDPs. They further show a realizability-only, gap-dependent rate by adding a lower-bound constraint on the density ratio and leveraging primal-gap analysis, enabling broad applicability with simple analyses. The unified, constraint-based LP framework thus relaxes key standard assumptions while delivering strong, practically relevant sample-complexity guarantees for offline RL across discounted, average-reward, and contextual-bandit settings.

Abstract

Offline reinforcement learning (RL) aims to find an optimal policy for Markov decision processes (MDPs) using a pre-collected dataset. In this work, we revisit the linear programming (LP) reformulation of Markov decision processes for offline RL, with the goal of developing algorithms with optimal sample complexity, where is the sample size, under partial data coverage and general function approximation, and with favorable computational tractability. To this end, we derive new \emph{error bounds} for both the dual and primal-dual formulations of the LP, and incorporate them properly as \emph{constraints} in the LP reformulation. We then show that under a completeness-type assumption, sample complexity can be achieved under standard single-policy coverage assumption, when one properly \emph{relaxes} the occupancy validity constraint in the LP. This framework can readily handle both infinite-horizon discounted and average-reward MDPs, in both general function approximation and tabular cases. The instantiation to the tabular case achieves either state-of-the-art or the first sample complexities of offline RL in these settings. To further remove any completeness-type assumption, we then introduce a proper \emph{lower-bound constraint} in the LP, and a variant of the standard single-policy coverage assumption. Such an algorithm leads to a sample complexity with dependence on the \emph{value-function gap}, with only realizability assumptions. Our properly constrained LP framework advances the existing results in several aspects, in relaxing certain assumptions and achieving the optimal sample complexity, with simple analyses. We hope our results bring new insights into the use of LP formulations and the equivalent primal-dual minimax optimization for offline RL, through the error-bound induced constraints.
Paper Structure (53 sections, 30 theorems, 100 equations)

This paper contains 53 sections, 30 theorems, 100 equations.

Key Result

Lemma 1

We have $u^\top w = r^{\top} \theta$ and $K w = M\theta.$

Theorems & Definitions (49)

  • Lemma 1
  • Definition 1: Completeness
  • Lemma 2
  • Lemma 3
  • Definition 2: Sign Function
  • Remark 1
  • Theorem 1
  • Remark 2
  • Theorem 2
  • Proposition 1
  • ...and 39 more