Table of Contents
Fetching ...

Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs

Zihan Zhou, Honghao Wei, Lei Ying

TL;DR

This work tackles best policy identification in online CMDPs with a model-free, PAC framework. It introduces PRI, a three-phase algorithm exploiting limited stochasticity: prune irrelevant actions, refine a mixture of greedy policies, and identify a single policy from the occupancy measure. In well-separated CMDPs, PRI achieves $\tilde{\mathcal{O}}(H\sqrt{K})$ regret and zero constraint violation, independent of $S$ and $A$ in the main terms, and provides a matching $\Omega(H\sqrt{K})$ lower bound up to polylog factors. Empirical results on synthetic CMDPs and grid-world corroborate substantial improvements over prior model-free methods and validate the practical viability of identifying near-optimal policies with few stochastic decisions.

Abstract

This paper considers the best policy identification (BPI) problem in online Constrained Markov Decision Processes (CMDPs). We are interested in algorithms that are model-free, have low regret, and identify an approximately optimal policy with a high probability. Existing model-free algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy and provide only average performance guarantees when a policy is uniformly sampled at random from all previously used policies. In this paper, we develop a new algorithm, named Pruning-Refinement-Identification (PRI), based on a fundamental structural property of CMDPs proved before, which we call limited stochasticity. The property says for a CMDP with $N$ constraints, there exists an optimal policy with at most $N$ stochastic decisions. The proposed algorithm first identifies at which step and in which state a stochastic decision has to be taken and then fine-tunes the distributions of these stochastic decisions. PRI achieves trio objectives: (i) PRI is a model-free algorithm; and (ii) it outputs an approximately optimal policy with a high probability at the end of learning; and (iii) PRI guarantees $\tilde{\mathcal{O}}(H\sqrt{K})$ regret and constraint violation, which significantly improves the best existing regret bound $\tilde{\mathcal{O}}(H^4 \sqrt{SA}K^{\frac{4}{5}})$ under a model-free algorithm, where $H$ is the length of each episode, $S$ is the number of states, $A$ is the number of actions, and the total number of episodes during learning is $2K+\tilde{\cal O}(K^{0.25}).$ We further present a matching lower via an example that shows under any online learning algorithm, there exists a well-separated CMDP instance such that either the regret or violation has to be $Ω(H\sqrt{K}),$ which matches the upper bound by a polylogarithmic factor.

Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs

TL;DR

This work tackles best policy identification in online CMDPs with a model-free, PAC framework. It introduces PRI, a three-phase algorithm exploiting limited stochasticity: prune irrelevant actions, refine a mixture of greedy policies, and identify a single policy from the occupancy measure. In well-separated CMDPs, PRI achieves regret and zero constraint violation, independent of and in the main terms, and provides a matching lower bound up to polylog factors. Empirical results on synthetic CMDPs and grid-world corroborate substantial improvements over prior model-free methods and validate the practical viability of identifying near-optimal policies with few stochastic decisions.

Abstract

This paper considers the best policy identification (BPI) problem in online Constrained Markov Decision Processes (CMDPs). We are interested in algorithms that are model-free, have low regret, and identify an approximately optimal policy with a high probability. Existing model-free algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy and provide only average performance guarantees when a policy is uniformly sampled at random from all previously used policies. In this paper, we develop a new algorithm, named Pruning-Refinement-Identification (PRI), based on a fundamental structural property of CMDPs proved before, which we call limited stochasticity. The property says for a CMDP with constraints, there exists an optimal policy with at most stochastic decisions. The proposed algorithm first identifies at which step and in which state a stochastic decision has to be taken and then fine-tunes the distributions of these stochastic decisions. PRI achieves trio objectives: (i) PRI is a model-free algorithm; and (ii) it outputs an approximately optimal policy with a high probability at the end of learning; and (iii) PRI guarantees regret and constraint violation, which significantly improves the best existing regret bound under a model-free algorithm, where is the length of each episode, is the number of states, is the number of actions, and the total number of episodes during learning is We further present a matching lower via an example that shows under any online learning algorithm, there exists a well-separated CMDP instance such that either the regret or violation has to be which matches the upper bound by a polylogarithmic factor.
Paper Structure (22 sections, 11 theorems, 82 equations, 4 figures, 6 tables, 5 algorithms)

This paper contains 22 sections, 11 theorems, 82 equations, 4 figures, 6 tables, 5 algorithms.

Key Result

Lemma 4.1

If $q^*=\{q^*_{h}(x,a)\}_{h,x,a}$ is an optimal solution to the CMDP problem (eq:cmdp-offline)-(eq:proba) and is an extreme point, then there are at most $HS+ N$ nonzero values in $q^*$. This implies that the optimal policy derived from $q^*$ includes at most $N$ stochastic decisions.

Figures (4)

  • Figure 1: Results for a synthetic CMDP with a unique solution, the shaded region represents the 95% confidence interval.
  • Figure 2: Results for the grid world environment, the shaded region represents the 95% confidence interval.
  • Figure 3: Grid World
  • Figure 4: The policy identified by PRI in the Grid World

Theorems & Definitions (18)

  • Lemma 4.1: Limited Stochasticity
  • Corollary 4.2
  • Lemma 4.3: Decomposition
  • Theorem 4.4
  • Theorem 4.5
  • Theorem 5.1
  • Theorem 5.2: Refinement
  • Theorem 5.3: Identification
  • Lemma 5.4
  • proof : Proof of Theorem \ref{['thm:main']}
  • ...and 8 more