Table of Contents
Fetching ...

Best-of-Both-Worlds Policy Optimization for CMDPs with Bandit Feedback

Francesco Emanuele Stradi, Anna Lunghi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti

TL;DR

This work delivers the first best-of-both-worlds algorithm for online CMDPs with bandit feedback, enabling robust learning under both stochastic and adversarial costs without full-information feedback or occupancy-measure optimization. It hinges on a primal-dual policy optimization framework that couples a no-interval regret primal updater with online dual updates, yielding $ ilde{O}(\sqrt{T})$ regret and constraint violations in the stochastic case and sublinear constraint violations with a no-$\alpha$-regret guarantee in the adversarial case, while automatically bounding the Lagrange multipliers. Importantly, the method avoids convex optimization over occupancy measures, improving computational efficiency and practicality. Theoretical results cover stochastic and adversarial constraint regimes, with and without Slater-type feasibility ($\rho$), and extend to a weaker baseline that preserves optimal stochastic-rate guarantees. Overall, the approach advances reliable, scalable CMDP optimization under bandit feedback, broadening applicability to real-world sequential decision problems with constraints.

Abstract

We study online learning in constrained Markov decision processes (CMDPs) in which rewards and constraints may be either stochastic or adversarial. In such settings, Stradi et al.(2024) proposed the first best-of-both-worlds algorithm able to seamlessly handle stochastic and adversarial constraints, achieving optimal regret and constraint violation bounds in both cases. This algorithm suffers from two major drawbacks. First, it only works under full feedback, which severely limits its applicability in practice. Moreover, it relies on optimizing over the space of occupancy measures, which requires solving convex optimization problems, an highly inefficient task. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback. Specifically, when the constraints are stochastic, the algorithm achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation, while, when they are adversarial, it attains $\widetilde{\mathcal{O}}(\sqrt{T})$ constraint violation and a tight fraction of the optimal reward. Moreover, our algorithm is based on a policy optimization approach, which is much more efficient than occupancy-measure-based methods.

Best-of-Both-Worlds Policy Optimization for CMDPs with Bandit Feedback

TL;DR

This work delivers the first best-of-both-worlds algorithm for online CMDPs with bandit feedback, enabling robust learning under both stochastic and adversarial costs without full-information feedback or occupancy-measure optimization. It hinges on a primal-dual policy optimization framework that couples a no-interval regret primal updater with online dual updates, yielding regret and constraint violations in the stochastic case and sublinear constraint violations with a no--regret guarantee in the adversarial case, while automatically bounding the Lagrange multipliers. Importantly, the method avoids convex optimization over occupancy measures, improving computational efficiency and practicality. Theoretical results cover stochastic and adversarial constraint regimes, with and without Slater-type feasibility (), and extend to a weaker baseline that preserves optimal stochastic-rate guarantees. Overall, the approach advances reliable, scalable CMDP optimization under bandit feedback, broadening applicability to real-world sequential decision problems with constraints.

Abstract

We study online learning in constrained Markov decision processes (CMDPs) in which rewards and constraints may be either stochastic or adversarial. In such settings, Stradi et al.(2024) proposed the first best-of-both-worlds algorithm able to seamlessly handle stochastic and adversarial constraints, achieving optimal regret and constraint violation bounds in both cases. This algorithm suffers from two major drawbacks. First, it only works under full feedback, which severely limits its applicability in practice. Moreover, it relies on optimizing over the space of occupancy measures, which requires solving convex optimization problems, an highly inefficient task. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback. Specifically, when the constraints are stochastic, the algorithm achieves regret and constraint violation, while, when they are adversarial, it attains constraint violation and a tight fraction of the optimal reward. Moreover, our algorithm is based on a policy optimization approach, which is much more efficient than occupancy-measure-based methods.
Paper Structure (34 sections, 30 theorems, 143 equations, 4 algorithms)

This paper contains 34 sections, 30 theorems, 143 equations, 4 algorithms.

Key Result

Theorem 3

For any $\delta\in(0,1)$, with probability at least $1-8\delta$, Algorithm FS-PODB attains: where the regret can be computed with respect to any policy function $\pi:X \rightarrow \Delta(A)$.

Theorems & Definitions (54)

  • Definition 1: Lagrangian function
  • Definition 2
  • Definition 3
  • Definition 4: Interval regret
  • Definition 5: No-interval regret property
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • ...and 44 more