Anytime-Competitive Reinforcement Learning with Policy Prior

Jianyi Yang; Pengfei Li; Tongxin Li; Adam Wierman; Shaolei Ren

Anytime-Competitive Reinforcement Learning with Policy Prior

Jianyi Yang, Pengfei Li, Tongxin Li, Adam Wierman, Shaolei Ren

TL;DR

This work introduces Anytime-Competitive MDPs (A-CMDP) in which an agent must guarantee per-round cost constraints relative to a policy prior for every round within any episode, while maximizing expected reward. It adopts a safe-action projection (ACD) to enforce constraints and then develops ACRL, a model-based RL algorithm that learns the transition dynamics and optimizes under the anytime constraints. Theoretical results characterize an intrinsic reward gap due to constraint satisfaction and provide sublinear pseudo-regret bounds for ACRL, with explicit dependence on Lipschitz constants, the prior’s telescoping properties, and the constraint parameters $\lambda$ and $b$; in linear-transition settings, the bounds become more favorable. Empirical evaluations on carbon-aware computing tasks demonstrate that ACRL satisfies the anytime constraints with zero violation and achieves competitive reward performance relative to unconstrained or expected-constraint baselines, highlighting practical relevance for mission-critical systems.

Abstract

This paper studies the problem of Anytime-Competitive Markov Decision Process (A-CMDP). Existing works on Constrained Markov Decision Processes (CMDPs) aim to optimize the expected reward while constraining the expected cost over random dynamics, but the cost in a specific episode can still be unsatisfactorily high. In contrast, the goal of A-CMDP is to optimize the expected reward while guaranteeing a bounded cost in each round of any episode against a policy prior. We propose a new algorithm, called Anytime-Competitive Reinforcement Learning (ACRL), which provably guarantees the anytime cost constraints. The regret analysis shows the policy asymptotically matches the optimal reward achievable under the anytime competitive constraints. Experiments on the application of carbon-intelligent computing verify the reward performance and cost constraint guarantee of ACRL.

Anytime-Competitive Reinforcement Learning with Policy Prior

TL;DR

and

; in linear-transition settings, the bounds become more favorable. Empirical evaluations on carbon-aware computing tasks demonstrate that ACRL satisfies the anytime constraints with zero violation and achieves competitive reward performance relative to unconstrained or expected-constraint baselines, highlighting practical relevance for mission-critical systems.

Abstract

Paper Structure (28 sections, 9 theorems, 58 equations, 5 figures, 1 table, 2 algorithms)

This paper contains 28 sections, 9 theorems, 58 equations, 5 figures, 1 table, 2 algorithms.

Introduction
Related Work
Problem Formulation
Anytime-Competitive MDP
Motivating Examples
Methods
Guarantee the Anytime Constraints
Anytime-Competitive RL
Performance Analysis
Regret due to Constraint Guarantee
Regret of ACRL
Empirical Results
Concluding Remarks
Empirical Results - Carbon-Aware Resource Management
Problem Formulation
...and 13 more sections

Key Result

Proposition 4.1

Suppose that Assumption asp:Lips and asp:telescoping are satisfied. At round $h$ with costs $\{c_i\}_{i=1}^{h-1}$ observed, the anytime competitive constraints $J_{h'}^{\pi} \leq (1+\lambda)J_{h'}^{\pi^{\dagger}}+h'b$ for rounds $h'=h,\cdots, H$ are satisfied if for all subsequent rounds $h'=h,\cdot where $\Gamma_{j,n}=\sum_{i=n}^{H}q_{j,i}, (j\in [H],\forall n\geq j)$, with $q_{j,i} = L_c\mathds{

Figures (5)

Figure 1: Regret and cost violation rate of different algorithms. Shadows in Figure \ref{['fig:regret_lambda']} show the range of the regret.
Figure 2: Regret and cost violation rate of different algorithms. Figure \ref{['fig:regret_episode']} gives the regret changing with episodes. Figure \ref{['fig:regret_lambda']} shows the regret with different $\lambda$ and $b$ after exploration for all the $4000$ episodes. Shadows in Figure \ref{['fig:regret_lambda']} show the range of regret. Figure \ref{['fig:violation_rate']} shows the probability of the violation of the anytime competitive constraints. Figure \ref{['fig:violation_rate']} shows the probability of the violation of the anytime competitive constraints.
Figure 3: QoS costs of different algorithms. Figure \ref{['fig:cost_worst']} and Figure \ref{['fig:cost_avg']} give the worst-case costs and the average cost for different algorithms under $b=2$, $\lambda=2$ and $\lambda=6$, respectively. For OPT-QoS$\pi^{\dagger}$, the worst-case cost $\max (J_H^{\pi^{\dagger}})$ and average cost $\mathbb{E}(J_H^{\pi^{\dagger}})$ are 34.95 and 27.92, respectively.
Figure 4: Regrets of different algorithms. The regret of Carbon-B is 51.417 and is out of the range of y axis.
Figure 5: Illustration of state perturbation

Theorems & Definitions (18)

Definition 3.1: Anytime competitive constraints
Definition 3.3: Telescoping policy
Proposition 4.1
Corollary 4.2
Theorem 5.1
Theorem 5.2
proof
proof
Lemma D.1
proof
...and 8 more

Anytime-Competitive Reinforcement Learning with Policy Prior

TL;DR

Abstract

Anytime-Competitive Reinforcement Learning with Policy Prior

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (18)