Table of Contents
Fetching ...

Exploration-Exploitation in Constrained MDPs

Yonathan Efroni, Shie Mannor, Matteo Pirotta

TL;DR

This work addresses online learning in constrained MDPs (CMDPs) by formalizing the exploration–exploitation trade-off under long-term constraints. It develops four algorithmic families—OptCMDP, OptCMDP-bonus, OptDual-CMDP, and OptPrimalDual-CMDP—that provide sublinear regret for the main objective and for constraint violations, with LP-based methods offering stronger guarantees than dual-based approaches. The paper analyzes learning in finite-horizon CMDPs using occupancy-measure LP formulations and optimistic planning, along with dual/primal–dual methods that offer computational advantages. It also discusses limitations of dual-based approaches regarding regret guarantees and points to future work on tightening these guarantees and balancing safety with scalability in RL. Overall, the results illuminate the tradeoffs between rigorous safety guarantees and practical computation in safe reinforcement learning under constraints.

Abstract

In many sequential decision-making problems, the goal is to optimize a utility function while satisfying a set of constraints on different utilities. This learning problem is formalized through Constrained Markov Decision Processes (CMDPs). In this paper, we investigate the exploration-exploitation dilemma in CMDPs. While learning in an unknown CMDP, an agent should trade-off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward while satisfying the constraints. While the agent will eventually learn a good or optimal policy, we do not want the agent to violate the constraints too often during the learning process. In this work, we analyze two approaches for learning in CMDPs. The first approach leverages the linear formulation of CMDP to perform optimistic planning at each episode. The second approach leverages the dual formulation (or saddle-point formulation) of CMDP to perform incremental, optimistic updates of the primal and dual variables. We show that both achieves sublinear regret w.r.t.\ the main utility while having a sublinear regret on the constraint violations. That being said, we highlight a crucial difference between the two approaches; the linear programming approach results in stronger guarantees than in the dual formulation based approach.

Exploration-Exploitation in Constrained MDPs

TL;DR

This work addresses online learning in constrained MDPs (CMDPs) by formalizing the exploration–exploitation trade-off under long-term constraints. It develops four algorithmic families—OptCMDP, OptCMDP-bonus, OptDual-CMDP, and OptPrimalDual-CMDP—that provide sublinear regret for the main objective and for constraint violations, with LP-based methods offering stronger guarantees than dual-based approaches. The paper analyzes learning in finite-horizon CMDPs using occupancy-measure LP formulations and optimistic planning, along with dual/primal–dual methods that offer computational advantages. It also discusses limitations of dual-based approaches regarding regret guarantees and points to future work on tightening these guarantees and balancing safety with scalability in RL. Overall, the results illuminate the tradeoffs between rigorous safety guarantees and practical computation in safe reinforcement learning under constraints.

Abstract

In many sequential decision-making problems, the goal is to optimize a utility function while satisfying a set of constraints on different utilities. This learning problem is formalized through Constrained Markov Decision Processes (CMDPs). In this paper, we investigate the exploration-exploitation dilemma in CMDPs. While learning in an unknown CMDP, an agent should trade-off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward while satisfying the constraints. While the agent will eventually learn a good or optimal policy, we do not want the agent to violate the constraints too often during the learning process. In this work, we analyze two approaches for learning in CMDPs. The first approach leverages the linear formulation of CMDP to perform optimistic planning at each episode. The second approach leverages the dual formulation (or saddle-point formulation) of CMDP to perform incremental, optimistic updates of the primal and dual variables. We show that both achieves sublinear regret w.r.t.\ the main utility while having a sublinear regret on the constraint violations. That being said, we highlight a crucial difference between the two approaches; the linear programming approach results in stronger guarantees than in the dual formulation based approach.

Paper Structure

This paper contains 47 sections, 48 theorems, 204 equations, 1 table, 5 algorithms.

Key Result

Proposition 1

The set $\Delta^\mu(\mathcal{M})$ of occupancy measure is convex.

Theorems & Definitions (82)

  • proof
  • Remark 1
  • Proposition 1
  • Proposition 2
  • proof
  • Theorem 2: Regret Bounds for
  • Theorem 2: Regret Bounds for
  • Remark 2
  • Theorem 2: Regret Bounds for
  • Theorem 2: Regret Bounds for
  • ...and 72 more