Table of Contents
Fetching ...

Efficient Exploration in Average-Reward Constrained Reinforcement Learning: Achieving Near-Optimal Regret With Posterior Sampling

Danil Provodin, Maurits Kaptein, Mykola Pechenizkiy

TL;DR

The paper addresses constrained reinforcement learning in infinite-horizon average-reward CMDPs with unknown transitions. It introduces PSConRL, a posterior-sampling algorithm that uses sampled CMDPs and an exploration strategy when feasibility is compromised, guided by occupancy-measure LPs. The main contributions are a per-cost-component Bayesian regret bound of $\tilde{O}(DS\sqrt{AT})$, a computationally tractable algorithm for communicating CMDPs, and empirical results showing superior performance over existing methods. This work advances practical safe RL by delivering near-optimal regret guarantees under average-reward constraints and scalable computation.

Abstract

We present a new algorithm based on posterior sampling for learning in Constrained Markov Decision Processes (CMDP) in the infinite-horizon undiscounted setting. The algorithm achieves near-optimal regret bounds while being advantageous empirically compared to the existing algorithms. Our main theoretical result is a Bayesian regret bound for each cost component of $\tilde{O} (DS\sqrt{AT})$ for any communicating CMDP with $S$ states, $A$ actions, and diameter $D$. This regret bound matches the lower bound in order of time horizon $T$ and is the best-known regret bound for communicating CMDPs achieved by a computationally tractable algorithm. Empirical results show that our posterior sampling algorithm outperforms the existing algorithms for constrained reinforcement learning.

Efficient Exploration in Average-Reward Constrained Reinforcement Learning: Achieving Near-Optimal Regret With Posterior Sampling

TL;DR

The paper addresses constrained reinforcement learning in infinite-horizon average-reward CMDPs with unknown transitions. It introduces PSConRL, a posterior-sampling algorithm that uses sampled CMDPs and an exploration strategy when feasibility is compromised, guided by occupancy-measure LPs. The main contributions are a per-cost-component Bayesian regret bound of , a computationally tractable algorithm for communicating CMDPs, and empirical results showing superior performance over existing methods. This work advances practical safe RL by delivering near-optimal regret guarantees under average-reward constraints and scalable computation.

Abstract

We present a new algorithm based on posterior sampling for learning in Constrained Markov Decision Processes (CMDP) in the infinite-horizon undiscounted setting. The algorithm achieves near-optimal regret bounds while being advantageous empirically compared to the existing algorithms. Our main theoretical result is a Bayesian regret bound for each cost component of for any communicating CMDP with states, actions, and diameter . This regret bound matches the lower bound in order of time horizon and is the best-known regret bound for communicating CMDPs achieved by a computationally tractable algorithm. Empirical results show that our posterior sampling algorithm outperforms the existing algorithms for constrained reinforcement learning.
Paper Structure (27 sections, 11 theorems, 51 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 27 sections, 11 theorems, 51 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Theorem 4.1

For any communicating CMDP $M$ with $S$ states, $A$ actions, under Assumptions assum:WASP and assum:slater, for $T \geq \Omega((D/\gamma)^4 S^2 A \log^2(2AT))$, the Bayesian regret for main and auxiliary cost components of Algorithm alg1:psrl_transitions are bounded: Here $O(\cdot)$ notation hides only the absolute constant.E

Figures (5)

  • Figure 1: CMDP illustration and results of the experiments for Example \ref{['counterexample']}, with $\theta=0.9$ and the average cost threshold $\tau=0.5275$. Figure \ref{['fig:counterexample_cmdp']} represents the CMDP in symbolic form. Figure \ref{['fig:counterexample_simulations']} presents average cost (left), and realizations of $\tilde{\theta}$ (right). Results are averaged over 5 runs.
  • Figure 2: The main regret and constraint violation of the algorithms as a function of the horizon for Marsrover 4x4 (left column), Marsrover 8x8 (middle column), and Box (right column). (Top row) shows the cumulative regret of the main cost component. (Bottom row) shows the cumulative constraint violation. Results are averaged over 50 runs for Marsrover 4x4 and over 30 runs for Marsrover 8x8 and Box. Results for UCRL-CMDP and FHA (Alg. 3) are averaged over 10 runs for Marsrover 4x4.
  • Figure 3: Marsrover gridworlds. The initial position is light green, the goal is dark green, the walls are gray, and risky states are purple. Figure \ref{['4x4_marsrover']} illustrates 4x4 Marsrover environment. Figure \ref{['8x8_marsrover']} illustrates 8x8 Marsrover environment. In both cases, the agent's task is to get from the initial state to the goal state, and the optimal policy combines with some probabilities fast and safe ways, which are indicated by arrows on the pictures.
  • Figure 4: Box gridworld. The initial position is light green, the goal is dark green, the walls are gray, and risky states are purple. Figure \ref{['box_main']} illustrates the initial configuration. The agent's task is to get from the initial state to the goal state, and the optimal policy combines with some probabilities fast and safe ways, which are indicated by arrows on the pictures. Figure \ref{['box_left']}-\ref{['box_down']} illustrates safe and fast ways.
  • Figure 5: (Top row) shows the average reward (inverse average main cost); the dashed line shows the optimal behavior, and the dotted lines depict the reward level of safe and fast policies. (Bottom row) shows the average consumption of the auxiliary cost; the constraint thresholds are 0.2 for Marsrover 4x4, 0.1 for Marsrover 8x8, and 0.6 for Box. Results are averaged over 100 runs for Marsrover 4x4 and over 30 runs for Marsrover 8x8 and Box.

Theorems & Definitions (19)

  • Remark 3.1
  • Example 3.2
  • Theorem 4.1
  • Remark 4.2
  • Lemma 4.3: Posterior sampling lemma; adapted from Lemma 1 of jafarniajahromi2021online
  • Lemma 4.4: Feasibility lemma
  • Lemma 4.5: Exploration lemma
  • proof
  • Lemma 1.1
  • proof
  • ...and 9 more