Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins

Qian Zuo; Zhiyong Wang; Fengxiang He

Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins

Qian Zuo, Zhiyong Wang, Fengxiang He

TL;DR

The Flexible safety Domain Optimization via Margin-regularized Exploration (FlexDOME) algorithm is proposed, the first to provably achieve near-constant $\tilde{O}(1)$ strong constraint violation alongside sublinear strong regret and non-asymptotic last-iterate convergence.

Abstract

We study safe online reinforcement learning in Constrained Markov Decision Processes (CMDPs) under strong regret and violation metrics, which forbid error cancellation over time. Existing primal-dual methods that achieve sublinear strong reward regret inevitably incur growing strong constraint violation or are restricted to average-iterate convergence due to inherent oscillations. To address these limitations, we propose the Flexible safety Domain Optimization via Margin-regularized Exploration (FlexDOME) algorithm, the first to provably achieve near-constant $\tilde{O}(1)$ strong constraint violation alongside sublinear strong regret and non-asymptotic last-iterate convergence. FlexDOME incorporates time-varying safety margins and regularization terms into the primal-dual framework. Our theoretical analysis relies on a novel term-wise asymptotic dominance strategy, where the safety margin is rigorously scheduled to asymptotically majorize the functional decay rates of the optimization and statistical errors, thereby clamping cumulative violations to a near-constant level. Furthermore, we establish non-asymptotic last-iterate convergence guarantees via a policy-dual Lyapunov argument. Experiments corroborate our theoretical findings.

Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins

TL;DR

The Flexible safety Domain Optimization via Margin-regularized Exploration (FlexDOME) algorithm is proposed, the first to provably achieve near-constant

strong constraint violation alongside sublinear strong regret and non-asymptotic last-iterate convergence.

Abstract

strong constraint violation alongside sublinear strong regret and non-asymptotic last-iterate convergence. FlexDOME incorporates time-varying safety margins and regularization terms into the primal-dual framework. Our theoretical analysis relies on a novel term-wise asymptotic dominance strategy, where the safety margin is rigorously scheduled to asymptotically majorize the functional decay rates of the optimization and statistical errors, thereby clamping cumulative violations to a near-constant level. Furthermore, we establish non-asymptotic last-iterate convergence guarantees via a policy-dual Lyapunov argument. Experiments corroborate our theoretical findings.

Paper Structure (53 sections, 28 theorems, 164 equations, 2 figures, 1 table, 3 algorithms)

This paper contains 53 sections, 28 theorems, 164 equations, 2 figures, 1 table, 3 algorithms.

Introduction
Related Work.
Preliminaries
Notation.
Constrained Markov decision process (CMDP).
Value and objective functions.
Training protocol.
FlexDOME
The Primal-Dual Scheme in FlexDOME
Decaying Safety margin.
Time-Varying Regularizations.
Estimates
Learning Algorithm
Theoretical Analysis
Strong Regret Bounds
...and 38 more sections

Key Result

Theorem 4.1

For any confidence parameter $\delta\in(0,1)$, let $\eta_{t}=t^{-5/6}$, $\tau_{t}=t^{-1/6}$, and $\epsilon_{i,t} =18/5\sqrt{H^3C_B}\left(t^{-1/6}\cdot \log(4SAHt/\delta)^{1/4}\right)$ for any constraint $i$. Then, with probability at least $1-\delta$, Algorithm alg:main achieves the following bounds where $T$ denotes the number of episodes, $C_B=O(m,S,A,H)$ is a $T$-independent constant and $\tild

Figures (2)

Figure 1: Performance comparison of FlexDOME (ours) against UOpt-RPGPD and Vanilla PD baselines under both stochastic-threshold (top row) and fixed-threshold (middle row) settings. The bottom row presents an ablation study on key components of our method: the safety margin, regularization, and the stochastic threshold mechanism. All plots show the mean and standard error over 5 seeds. Performance is measured by the instantaneous optimality gap and constraint violation, alongside their corresponding strong regrets.
Figure 2: Impact of (a) exploration bonus scaler and (b) safety margin scaler on strong regret and violation. Vertical lines denote the selected baseline parameters. Results are averaged over 5 random seeds with standard error bands.

Theorems & Definitions (51)

Remark 2.1
Theorem 4.1: Strong regret bounds for reward and violation
Remark 4.2
Theorem 4.3: Last-iterate convergence
Remark 4.4
Lemma 4.5: Convergence
Remark 4.6
Lemma 4.7: Per-episode trade-off
Lemma 4.8
Lemma 4.9
...and 41 more

Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins

TL;DR

Abstract

Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (51)