Table of Contents
Fetching ...

Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism

Kihyun Yu, Duksang Lee, William Overman, Dabeen Lee

TL;DR

A model-based algorithm based on novel cost and reward function estimators that provide tighter cost pessimism and reward optimism and improves upon the best-known regret upper bound is proposed.

Abstract

This paper studies the safe reinforcement learning problem formulated as an episodic finite-horizon tabular constrained Markov decision process with an unknown transition kernel and stochastic reward and cost functions. We propose a model-based algorithm based on novel cost and reward function estimators that provide tighter cost pessimism and reward optimism. While guaranteeing no constraint violation in every episode, our algorithm achieves a regret upper bound of $\widetilde{\mathcal{O}}((\bar C - \bar C_b)^{-1}H^{2.5} S\sqrt{AK})$ where $\bar C$ is the cost budget for an episode, $\bar C_b$ is the expected cost under a safe baseline policy over an episode, $H$ is the horizon, and $S$, $A$ and $K$ are the number of states, actions, and episodes, respectively. This improves upon the best-known regret upper bound, and when $\bar C- \bar C_b=Ω(H)$, it nearly matches the regret lower bound of $Ω(H^{1.5}\sqrt{SAK})$. We deduce our cost and reward function estimators via a Bellman-type law of total variance to obtain tight bounds on the expected sum of the variances of value function estimates. This leads to a tighter dependence on the horizon in the function estimators. We also present numerical results to demonstrate the computational effectiveness of our proposed framework.

Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism

TL;DR

A model-based algorithm based on novel cost and reward function estimators that provide tighter cost pessimism and reward optimism and improves upon the best-known regret upper bound is proposed.

Abstract

This paper studies the safe reinforcement learning problem formulated as an episodic finite-horizon tabular constrained Markov decision process with an unknown transition kernel and stochastic reward and cost functions. We propose a model-based algorithm based on novel cost and reward function estimators that provide tighter cost pessimism and reward optimism. While guaranteeing no constraint violation in every episode, our algorithm achieves a regret upper bound of where is the cost budget for an episode, is the expected cost under a safe baseline policy over an episode, is the horizon, and , and are the number of states, actions, and episodes, respectively. This improves upon the best-known regret upper bound, and when , it nearly matches the regret lower bound of . We deduce our cost and reward function estimators via a Bellman-type law of total variance to obtain tight bounds on the expected sum of the variances of value function estimates. This leads to a tighter dependence on the horizon in the function estimators. We also present numerical results to demonstrate the computational effectiveness of our proposed framework.

Paper Structure

This paper contains 28 sections, 29 theorems, 225 equations, 2 figures, 2 tables, 1 algorithm.

Key Result

Lemma 2.1

With probability at least $1-4\delta$, it holds that for any $(s,a,h)\in \mathcal{S}\times\mathcal{A}\times[H]$ and $k\in[K]$,

Figures (2)

  • Figure 1: Comparison of DOPE+ and DOPE: Regret (Left) and Hard Constraint Violation (Right)
  • Figure 2: Transition Probability for Taking $a_1$ and $a_2$ at Each State: Taking $a_1$ (Left) and Taking $a_2$ (Right)

Theorems & Definitions (31)

  • Lemma 2.1
  • Lemma 3.1
  • Lemma 3.2
  • Theorem 1
  • Remark 1
  • Lemma 3.3
  • Lemma 4.1
  • Theorem 2
  • Theorem 3
  • Remark 2
  • ...and 21 more