Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs

Zihan Zhou; Honghao Wei; Lei Ying

Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs

Zihan Zhou, Honghao Wei, Lei Ying

TL;DR

This work tackles best policy identification in online CMDPs with a model-free, PAC framework. It introduces PRI, a three-phase algorithm exploiting limited stochasticity: prune irrelevant actions, refine a mixture of greedy policies, and identify a single policy from the occupancy measure. In well-separated CMDPs, PRI achieves $\tilde{\mathcal{O}}(H\sqrt{K})$ regret and zero constraint violation, independent of $S$ and $A$ in the main terms, and provides a matching $\Omega(H\sqrt{K})$ lower bound up to polylog factors. Empirical results on synthetic CMDPs and grid-world corroborate substantial improvements over prior model-free methods and validate the practical viability of identifying near-optimal policies with few stochastic decisions.

Abstract

This paper considers the best policy identification (BPI) problem in online Constrained Markov Decision Processes (CMDPs). We are interested in algorithms that are model-free, have low regret, and identify an approximately optimal policy with a high probability. Existing model-free algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy and provide only average performance guarantees when a policy is uniformly sampled at random from all previously used policies. In this paper, we develop a new algorithm, named Pruning-Refinement-Identification (PRI), based on a fundamental structural property of CMDPs proved before, which we call limited stochasticity. The property says for a CMDP with $N$ constraints, there exists an optimal policy with at most $N$ stochastic decisions. The proposed algorithm first identifies at which step and in which state a stochastic decision has to be taken and then fine-tunes the distributions of these stochastic decisions. PRI achieves trio objectives: (i) PRI is a model-free algorithm; and (ii) it outputs an approximately optimal policy with a high probability at the end of learning; and (iii) PRI guarantees $\tilde{\mathcal{O}}(H\sqrt{K})$ regret and constraint violation, which significantly improves the best existing regret bound $\tilde{\mathcal{O}}(H^4 \sqrt{SA}K^{\frac{4}{5}})$ under a model-free algorithm, where $H$ is the length of each episode, $S$ is the number of states, $A$ is the number of actions, and the total number of episodes during learning is $2K+\tilde{\cal O}(K^{0.25}).$ We further present a matching lower via an example that shows under any online learning algorithm, there exists a well-separated CMDP instance such that either the regret or violation has to be $Ω(H\sqrt{K}),$ which matches the upper bound by a polylogarithmic factor.

Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs

TL;DR

regret and zero constraint violation, independent of

and

in the main terms, and provides a matching

lower bound up to polylog factors. Empirical results on synthetic CMDPs and grid-world corroborate substantial improvements over prior model-free methods and validate the practical viability of identifying near-optimal policies with few stochastic decisions.

Abstract

constraints, there exists an optimal policy with at most

stochastic decisions. The proposed algorithm first identifies at which step and in which state a stochastic decision has to be taken and then fine-tunes the distributions of these stochastic decisions. PRI achieves trio objectives: (i) PRI is a model-free algorithm; and (ii) it outputs an approximately optimal policy with a high probability at the end of learning; and (iii) PRI guarantees

regret and constraint violation, which significantly improves the best existing regret bound

under a model-free algorithm, where

is the length of each episode,

is the number of states,

is the number of actions, and the total number of episodes during learning is

We further present a matching lower via an example that shows under any online learning algorithm, there exists a well-separated CMDP instance such that either the regret or violation has to be

which matches the upper bound by a polylogarithmic factor.

Paper Structure (22 sections, 11 theorems, 82 equations, 4 figures, 6 tables, 5 algorithms)

This paper contains 22 sections, 11 theorems, 82 equations, 4 figures, 6 tables, 5 algorithms.

Introduction
Related Work
Problem Formulation
PRI (Pruning-Refinement-Identification)
Analysis
Experiments
Synthetic CMDP
Grid-world
Conclusions
NOTATION TABLE
Review of Triple-Q
Proofs of the Technical Lemmas
Proof of Lemma \ref{['le:spar']} (Limited Stochasticity)
Proof of Lemma \ref{['lem:decom']} (Decomposition)
Proof of Theorem \ref{['the:multi']} (policy pruning)
...and 7 more sections

Key Result

Lemma 4.1

If $q^*=\{q^*_{h}(x,a)\}_{h,x,a}$ is an optimal solution to the CMDP problem (eq:cmdp-offline)-(eq:proba) and is an extreme point, then there are at most $HS+ N$ nonzero values in $q^*$. This implies that the optimal policy derived from $q^*$ includes at most $N$ stochastic decisions.

Figures (4)

Figure 1: Results for a synthetic CMDP with a unique solution, the shaded region represents the 95% confidence interval.
Figure 2: Results for the grid world environment, the shaded region represents the 95% confidence interval.
Figure 3: Grid World
Figure 4: The policy identified by PRI in the Grid World

Theorems & Definitions (18)

Lemma 4.1: Limited Stochasticity
Corollary 4.2
Lemma 4.3: Decomposition
Theorem 4.4
Theorem 4.5
Theorem 5.1
Theorem 5.2: Refinement
Theorem 5.3: Identification
Lemma 5.4
proof : Proof of Theorem \ref{['thm:main']}
...and 8 more

Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs

TL;DR

Abstract

Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (18)