Efficient Exploration in Average-Reward Constrained Reinforcement Learning: Achieving Near-Optimal Regret With Posterior Sampling
Danil Provodin, Maurits Kaptein, Mykola Pechenizkiy
TL;DR
The paper addresses constrained reinforcement learning in infinite-horizon average-reward CMDPs with unknown transitions. It introduces PSConRL, a posterior-sampling algorithm that uses sampled CMDPs and an exploration strategy when feasibility is compromised, guided by occupancy-measure LPs. The main contributions are a per-cost-component Bayesian regret bound of $\tilde{O}(DS\sqrt{AT})$, a computationally tractable algorithm for communicating CMDPs, and empirical results showing superior performance over existing methods. This work advances practical safe RL by delivering near-optimal regret guarantees under average-reward constraints and scalable computation.
Abstract
We present a new algorithm based on posterior sampling for learning in Constrained Markov Decision Processes (CMDP) in the infinite-horizon undiscounted setting. The algorithm achieves near-optimal regret bounds while being advantageous empirically compared to the existing algorithms. Our main theoretical result is a Bayesian regret bound for each cost component of $\tilde{O} (DS\sqrt{AT})$ for any communicating CMDP with $S$ states, $A$ actions, and diameter $D$. This regret bound matches the lower bound in order of time horizon $T$ and is the best-known regret bound for communicating CMDPs achieved by a computationally tractable algorithm. Empirical results show that our posterior sampling algorithm outperforms the existing algorithms for constrained reinforcement learning.
