Table of Contents
Fetching ...

Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation

Toshinori Kitamura, Arnob Ghosh, Tadashi Kozuno, Wataru Kumagai, Kazumi Kasaura, Kenta Hoshino, Yohei Hosoe, Yutaka Matsuo

TL;DR

The paper tackles episodic RL in constrained CMDPs with function approximation by introducing OPSE-LCMDP, a algorithmic framework that integrates optimistic reward estimates with pessimistic safety guarantees via a composite softmax policy. A novel deployment rule for a strictly safe policy ensures zero episode-wise constraint violations while a bisection search over a λ parameter balances exploration and safety, yielding a $ ilde{O}(\,\sqrt{K}\,)$ regret with safety guarantees and polynomial-time computation. Theoretical results show sublinear regret and zero-violation across linear CMDPs, and experiments corroborate practical efficiency and safety advantages over prior linear CMDP approaches and tabular baselines. Overall, the work enables scalable, provably safe RL in large CMDPs with linear structure, addressing a key gap between safety and function approximation. The methods have potential impact for real-world safety-critical decision-making where large state spaces preclude LP-based or exhaustive tabular approaches.

Abstract

We study the reinforcement learning (RL) problem in a constrained Markov decision process (CMDP), where an agent explores the environment to maximize the expected cumulative reward while satisfying a single constraint on the expected total utility value in every episode. While this problem is well understood in the tabular setting, theoretical results for function approximation remain scarce. This paper closes the gap by proposing an RL algorithm for linear CMDPs that achieves $\tilde{\mathcal{O}}(\sqrt{K})$ regret with an episode-wise zero-violation guarantee. Furthermore, our method is computationally efficient, scaling polynomially with problem-dependent parameters while remaining independent of the state space size. Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.

Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation

TL;DR

The paper tackles episodic RL in constrained CMDPs with function approximation by introducing OPSE-LCMDP, a algorithmic framework that integrates optimistic reward estimates with pessimistic safety guarantees via a composite softmax policy. A novel deployment rule for a strictly safe policy ensures zero episode-wise constraint violations while a bisection search over a λ parameter balances exploration and safety, yielding a regret with safety guarantees and polynomial-time computation. Theoretical results show sublinear regret and zero-violation across linear CMDPs, and experiments corroborate practical efficiency and safety advantages over prior linear CMDP approaches and tabular baselines. Overall, the work enables scalable, provably safe RL in large CMDPs with linear structure, addressing a key gap between safety and function approximation. The methods have potential impact for real-world safety-critical decision-making where large state spaces preclude LP-based or exhaustive tabular approaches.

Abstract

We study the reinforcement learning (RL) problem in a constrained Markov decision process (CMDP), where an agent explores the environment to maximize the expected cumulative reward while satisfying a single constraint on the expected total utility value in every episode. While this problem is well understood in the tabular setting, theoretical results for function approximation remain scarce. This paper closes the gap by proposing an RL algorithm for linear CMDPs that achieves regret with an episode-wise zero-violation guarantee. Furthermore, our method is computationally efficient, scaling polynomially with problem-dependent parameters while remaining independent of the state space size. Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.

Paper Structure

This paper contains 49 sections, 61 theorems, 147 equations, 1 figure, 1 table, 2 algorithms.

Key Result

Lemma 1

For any $\pi$ and $k$, with probability (w.p.) at least $1-\delta$,

Figures (1)

  • Figure 1: Numerical comparison of the algorithms in the synthetic tabular environment (Top), the media streaming environment (Middle), and the synthetic linear environment (Bottom). We do not run DOPE in the linear CMDP environment due to its computational intractability (see \ref{['remark:DOPE not in linear']}). Left: regret (\ref{['eq:CMDP-goal']}), Middle: violation regret (\ref{['eq:vio-regret']}), and Right: total number of $\pi^\mathrm{sf}$ deployments in \ref{['algo:zero-vio-linear MDP']}.

Theorems & Definitions (114)

  • Lemma 1: Confidence bounds
  • Definition 1: $\pi^\mathrm{sf}$ unconfident iterations
  • Theorem 1
  • Theorem 2: Logarithmic $\lvert\ref{['def:unconf-set']}\rvert$ bound
  • Lemma 2: Mixture policy feasibility
  • Corollary 1
  • Corollary 2: Zero-violation
  • Lemma 3: $\pi_{\alpha^{(k)}}$ optimism
  • Theorem 3
  • Theorem 4
  • ...and 104 more