Table of Contents
Fetching ...

Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins

Qian Zuo, Zhiyong Wang, Fengxiang He

TL;DR

The Flexible safety Domain Optimization via Margin-regularized Exploration (FlexDOME) algorithm is proposed, the first to provably achieve near-constant $\tilde{O}(1)$ strong constraint violation alongside sublinear strong regret and non-asymptotic last-iterate convergence.

Abstract

We study safe online reinforcement learning in Constrained Markov Decision Processes (CMDPs) under strong regret and violation metrics, which forbid error cancellation over time. Existing primal-dual methods that achieve sublinear strong reward regret inevitably incur growing strong constraint violation or are restricted to average-iterate convergence due to inherent oscillations. To address these limitations, we propose the Flexible safety Domain Optimization via Margin-regularized Exploration (FlexDOME) algorithm, the first to provably achieve near-constant $\tilde{O}(1)$ strong constraint violation alongside sublinear strong regret and non-asymptotic last-iterate convergence. FlexDOME incorporates time-varying safety margins and regularization terms into the primal-dual framework. Our theoretical analysis relies on a novel term-wise asymptotic dominance strategy, where the safety margin is rigorously scheduled to asymptotically majorize the functional decay rates of the optimization and statistical errors, thereby clamping cumulative violations to a near-constant level. Furthermore, we establish non-asymptotic last-iterate convergence guarantees via a policy-dual Lyapunov argument. Experiments corroborate our theoretical findings.

Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins

TL;DR

The Flexible safety Domain Optimization via Margin-regularized Exploration (FlexDOME) algorithm is proposed, the first to provably achieve near-constant strong constraint violation alongside sublinear strong regret and non-asymptotic last-iterate convergence.

Abstract

We study safe online reinforcement learning in Constrained Markov Decision Processes (CMDPs) under strong regret and violation metrics, which forbid error cancellation over time. Existing primal-dual methods that achieve sublinear strong reward regret inevitably incur growing strong constraint violation or are restricted to average-iterate convergence due to inherent oscillations. To address these limitations, we propose the Flexible safety Domain Optimization via Margin-regularized Exploration (FlexDOME) algorithm, the first to provably achieve near-constant strong constraint violation alongside sublinear strong regret and non-asymptotic last-iterate convergence. FlexDOME incorporates time-varying safety margins and regularization terms into the primal-dual framework. Our theoretical analysis relies on a novel term-wise asymptotic dominance strategy, where the safety margin is rigorously scheduled to asymptotically majorize the functional decay rates of the optimization and statistical errors, thereby clamping cumulative violations to a near-constant level. Furthermore, we establish non-asymptotic last-iterate convergence guarantees via a policy-dual Lyapunov argument. Experiments corroborate our theoretical findings.
Paper Structure (53 sections, 28 theorems, 164 equations, 2 figures, 1 table, 3 algorithms)

This paper contains 53 sections, 28 theorems, 164 equations, 2 figures, 1 table, 3 algorithms.

Key Result

Theorem 4.1

For any confidence parameter $\delta\in(0,1)$, let $\eta_{t}=t^{-5/6}$, $\tau_{t}=t^{-1/6}$, and $\epsilon_{i,t} =18/5\sqrt{H^3C_B}\left(t^{-1/6}\cdot \log(4SAHt/\delta)^{1/4}\right)$ for any constraint $i$. Then, with probability at least $1-\delta$, Algorithm alg:main achieves the following bounds where $T$ denotes the number of episodes, $C_B=O(m,S,A,H)$ is a $T$-independent constant and $\tild

Figures (2)

  • Figure 1: Performance comparison of FlexDOME (ours) against UOpt-RPGPD and Vanilla PD baselines under both stochastic-threshold (top row) and fixed-threshold (middle row) settings. The bottom row presents an ablation study on key components of our method: the safety margin, regularization, and the stochastic threshold mechanism. All plots show the mean and standard error over 5 seeds. Performance is measured by the instantaneous optimality gap and constraint violation, alongside their corresponding strong regrets.
  • Figure 2: Impact of (a) exploration bonus scaler and (b) safety margin scaler on strong regret and violation. Vertical lines denote the selected baseline parameters. Results are averaged over 5 random seeds with standard error bands.

Theorems & Definitions (51)

  • Remark 2.1
  • Theorem 4.1: Strong regret bounds for reward and violation
  • Remark 4.2
  • Theorem 4.3: Last-iterate convergence
  • Remark 4.4
  • Lemma 4.5: Convergence
  • Remark 4.6
  • Lemma 4.7: Per-episode trade-off
  • Lemma 4.8
  • Lemma 4.9
  • ...and 41 more