Table of Contents
Fetching ...

Deterministic Policies for Constrained Reinforcement Learning in Polynomial Time

Jeremy McMahan

TL;DR

This work answers three open questions spanning two long-standing lines of research: polynomial-time approximability is possible for 1) anytime-constrained policies, 2) almost-sure-constrained policies, and 3) deterministic expectation-constrained policies.

Abstract

We present a novel algorithm that efficiently computes near-optimal deterministic policies for constrained reinforcement learning (CRL) problems. Our approach combines three key ideas: (1) value-demand augmentation, (2) action-space approximate dynamic programming, and (3) time-space rounding. Our algorithm constitutes a fully polynomial-time approximation scheme (FPTAS) for any time-space recursive (TSR) cost criteria. A TSR criteria requires the cost of a policy to be computable recursively over both time and (state) space, which includes classical expectation, almost sure, and anytime constraints. Our work answers three open questions spanning two long-standing lines of research: polynomial-time approximability is possible for 1) anytime-constrained policies, 2) almost-sure-constrained policies, and 3) deterministic expectation-constrained policies.

Deterministic Policies for Constrained Reinforcement Learning in Polynomial Time

TL;DR

This work answers three open questions spanning two long-standing lines of research: polynomial-time approximability is possible for 1) anytime-constrained policies, 2) almost-sure-constrained policies, and 3) deterministic expectation-constrained policies.

Abstract

We present a novel algorithm that efficiently computes near-optimal deterministic policies for constrained reinforcement learning (CRL) problems. Our approach combines three key ideas: (1) value-demand augmentation, (2) action-space approximate dynamic programming, and (3) time-space rounding. Our algorithm constitutes a fully polynomial-time approximation scheme (FPTAS) for any time-space recursive (TSR) cost criteria. A TSR criteria requires the cost of a policy to be computable recursively over both time and (state) space, which includes classical expectation, almost sure, and anytime constraints. Our work answers three open questions spanning two long-standing lines of research: polynomial-time approximability is possible for 1) anytime-constrained policies, 2) almost-sure-constrained policies, and 3) deterministic expectation-constrained policies.
Paper Structure (89 sections, 24 theorems, 69 equations, 6 algorithms)

This paper contains 89 sections, 24 theorems, 69 equations, 6 algorithms.

Key Result

Proposition 1

If $C$ is TR, then $C$ satisfies the usual optimality equations. Furthermore, $\mathop{\mathrm{arg\,min}}\limits_{\pi \in \Pi^D} C_M^{\pi}$ can be computed using backward induction in $O(HS^2 A)$ time.

Theorems & Definitions (63)

  • Definition 1: TSR
  • Proposition 1: TR Intuition
  • Proposition 2: TSR examples
  • Remark 1: Extensions
  • Remark 2: Inapproximability
  • Proposition 3: Packing-Covering Reduction
  • Definition 2: Cover MDP
  • Theorem 1: Reduction
  • Remark 3: Execution
  • Definition 3
  • ...and 53 more