Table of Contents
Fetching ...

Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms

Vaneet Aggarwal, Washim Uddin Mondal, Qinbo Bai

TL;DR

This work studies constrained CMDPs under average-reward objectives, developing both model-based and model-free algorithms with rigorous performance guarantees. It presents optimism-based and posterior-sampling model-based CMDP methods achieving sublinear regret with zero constraint violations, and a parametric model-free approach using a primal-dual policy gradient with provable convergence and regret bounds. Extensions to weakly communicating and non-ergodic MDPs are explored, with sublinear regret guarantees in the WC case and open questions identified for tighter bounds and parameter-free methods. Across theory and experiments (queueing flow-control example), the framework demonstrates practical viability for long-run constrained decision-making in unknown environments, bridging online learning guarantees with CMDP constraints.

Abstract

Reinforcement Learning (RL) serves as a versatile framework for sequential decision-making, finding applications across diverse domains such as robotics, autonomous driving, recommendation systems, supply chain optimization, biology, mechanics, and finance. The primary objective in these applications is to maximize the average reward. Real-world scenarios often necessitate adherence to specific constraints during the learning process. This monograph focuses on the exploration of various model-based and model-free approaches for Constrained RL within the context of average reward Markov Decision Processes (MDPs). The investigation commences with an examination of model-based strategies, delving into two foundational methods - optimism in the face of uncertainty and posterior sampling. Subsequently, the discussion transitions to parametrized model-free approaches, where the primal-dual policy gradient-based algorithm is explored as a solution for constrained MDPs. The monograph provides regret guarantees and analyzes constraint violation for each of the discussed setups. For the above exploration, we assume the underlying MDP to be ergodic. Further, this monograph extends its discussion to encompass results tailored for weakly communicating MDPs, thereby broadening the scope of its findings and their relevance to a wider range of practical scenarios.

Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms

TL;DR

This work studies constrained CMDPs under average-reward objectives, developing both model-based and model-free algorithms with rigorous performance guarantees. It presents optimism-based and posterior-sampling model-based CMDP methods achieving sublinear regret with zero constraint violations, and a parametric model-free approach using a primal-dual policy gradient with provable convergence and regret bounds. Extensions to weakly communicating and non-ergodic MDPs are explored, with sublinear regret guarantees in the WC case and open questions identified for tighter bounds and parameter-free methods. Across theory and experiments (queueing flow-control example), the framework demonstrates practical viability for long-run constrained decision-making in unknown environments, bridging online learning guarantees with CMDP constraints.

Abstract

Reinforcement Learning (RL) serves as a versatile framework for sequential decision-making, finding applications across diverse domains such as robotics, autonomous driving, recommendation systems, supply chain optimization, biology, mechanics, and finance. The primary objective in these applications is to maximize the average reward. Real-world scenarios often necessitate adherence to specific constraints during the learning process. This monograph focuses on the exploration of various model-based and model-free approaches for Constrained RL within the context of average reward Markov Decision Processes (MDPs). The investigation commences with an examination of model-based strategies, delving into two foundational methods - optimism in the face of uncertainty and posterior sampling. Subsequently, the discussion transitions to parametrized model-free approaches, where the primal-dual policy gradient-based algorithm is explored as a solution for constrained MDPs. The monograph provides regret guarantees and analyzes constraint violation for each of the discussed setups. For the above exploration, we assume the underlying MDP to be ergodic. Further, this monograph extends its discussion to encompass results tailored for weakly communicating MDPs, thereby broadening the scope of its findings and their relevance to a wider range of practical scenarios.
Paper Structure (35 sections, 47 theorems, 221 equations, 1 figure, 3 tables, 5 algorithms)

This paper contains 35 sections, 47 theorems, 221 equations, 1 figure, 3 tables, 5 algorithms.

Key Result

Lemma 1.1

Let $f: \mathbb{R} \rightarrow \mathbb{R}$ be a convex function, and let $X$ be a random variable. If $E[X]$ is finite, then

Figures (1)

  • Figure 1: Performance of the proposed C-UCRL and C-PSRL algorithms on a flow and service control problem for a single queue with doubling epoch lengths and linearly increasing epoch lengths. The algorithms are compared against chen2022learning and singh2020learning. We note that the considered algorithms C-UCRL and C-PSRL are labeled UC-CURL and PS-CURL, respectively, in the figure.

Theorems & Definitions (84)

  • Lemma 1.1: Jensen's Inequality
  • Lemma 1.2: Cauchy-Schwarz Inequality dragomir2003survey
  • Lemma 1.3
  • Lemma 1.4: Azuma-Hoeffding's Inequality serfling1974probability
  • Lemma 1.5: Any interval Azuma's inequality, chen2022learning
  • Lemma 1.6
  • Lemma 1.7
  • Lemma 1.8
  • Lemma 1.9
  • Remark 2.1
  • ...and 74 more