Optimal Strong Regret and Violation in Constrained MDPs via Policy Optimization

Francesco Emanuele Stradi; Matteo Castiglioni; Alberto Marchesi; Nicola Gatti

Optimal Strong Regret and Violation in Constrained MDPs via Policy Optimization

Francesco Emanuele Stradi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti

TL;DR

A primal-dual scheme that employs a state-of-the-art policy optimization approach for adversarial MDPs as primal algorithm, and a UCB-like update for dual variables, and provides an efficient policy optimization algorithm with strong regret/violation.

Abstract

We study online learning in \emph{constrained MDPs} (CMDPs), focusing on the goal of attaining sublinear strong regret and strong cumulative constraint violation. Differently from their standard (weak) counterparts, these metrics do not allow negative terms to compensate positive ones, raising considerable additional challenges. Efroni et al. (2020) were the first to propose an algorithm with sublinear strong regret and strong violation, by exploiting linear programming. Thus, their algorithm is highly inefficient, leaving as an open problem achieving sublinear bounds by means of policy optimization methods, which are much more efficient in practice. Very recently, Muller et al. (2024) have partially addressed this problem by proposing a policy optimization method that allows to attain $\widetilde{\mathcal{O}}(T^{0.93})$ strong regret/violation. This still leaves open the question of whether optimal bounds are achievable by using an approach of this kind. We answer such a question affirmatively, by providing an efficient policy optimization algorithm with $\widetilde{\mathcal{O}}(\sqrt{T})$ strong regret/violation. Our algorithm implements a primal-dual scheme that employs a state-of-the-art policy optimization approach for adversarial (unconstrained) MDPs as primal algorithm, and a UCB-like update for dual variables.

Optimal Strong Regret and Violation in Constrained MDPs via Policy Optimization

TL;DR

Abstract

strong regret/violation. This still leaves open the question of whether optimal bounds are achievable by using an approach of this kind. We answer such a question affirmatively, by providing an efficient policy optimization algorithm with

strong regret/violation. Our algorithm implements a primal-dual scheme that employs a state-of-the-art policy optimization approach for adversarial (unconstrained) MDPs as primal algorithm, and a UCB-like update for dual variables.

Paper Structure (31 sections, 23 theorems, 87 equations, 2 algorithms)

This paper contains 31 sections, 23 theorems, 87 equations, 2 algorithms.

Introduction
Related Works
Preliminaries
Constrained Markov Decision Processes
Offline Optimization in CMDPs
Online Learning in Episodic CMDPs
Parameters Estimation
Compact notation
A Novel Primal-Dual Algorithm
The CPD-PO Algorithm
Algorithm Comparison with Exploration_Exploitation and mullertruly
Theoretical Analysis
Results on the Lagrangian Formulation
Primal Algorithm
Regret and Violation
...and 16 more sections

Key Result

Lemma 1

Given a confidence parameter $\delta\in(0,1)$, with probability at least $1-\delta$, the following holds for every episode $t\in[T]$ and state-action pair $(x,a)\in X \times A$:

Theorems & Definitions (37)

Definition 1: Lagrangian function
Lemma 1
Lemma 2
Lemma 3
Lemma 4
Lemma 5
Theorem 1
Theorem 2
Lemma 6
proof
...and 27 more

Optimal Strong Regret and Violation in Constrained MDPs via Policy Optimization

TL;DR

Abstract

Optimal Strong Regret and Violation in Constrained MDPs via Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (37)