Table of Contents
Fetching ...

Spectral-Risk Safe Reinforcement Learning with Convergence Guarantees

Dohyeong Kim, Taehyun Cho, Seungyub Han, Hojun Chung, Kyungjae Lee, Songhwai Oh

TL;DR

This work tackles risk-constrained reinforcement learning where nonlinear spectral risk measures impede convergence. It introduces SRCPO, a bilevel optimization framework that leverages the dual form of spectral risk measures to separate inner policy optimization from outer dual-variable updates, with a discretized spectrum and a sampler over the discretization parameters to guarantee convergence in the tabular setting. The inner problem uses novel risk value functions and a policy-gradient-like update with convergence guarantees, while the outer problem searches for an optimal dual representation via a sampler over the discretized risk function, enabling joint training. Empirically, SRCPO achieves top performance among constrained RCRL methods on continuous control tasks while strictly satisfying risk constraints, and it supports multiple risk measures through spectrum discretization. These results provide a principled and scalable approach for safe RL in safety-critical domains where tail risks must be tightly controlled.

Abstract

The field of risk-constrained reinforcement learning (RCRL) has been developed to effectively reduce the likelihood of worst-case scenarios by explicitly handling risk-measure-based constraints. However, the nonlinearity of risk measures makes it challenging to achieve convergence and optimality. To overcome the difficulties posed by the nonlinearity, we propose a spectral risk measure-constrained RL algorithm, spectral-risk-constrained policy optimization (SRCPO), a bilevel optimization approach that utilizes the duality of spectral risk measures. In the bilevel optimization structure, the outer problem involves optimizing dual variables derived from the risk measures, while the inner problem involves finding an optimal policy given these dual variables. The proposed method, to the best of our knowledge, is the first to guarantee convergence to an optimum in the tabular setting. Furthermore, the proposed method has been evaluated on continuous control tasks and showed the best performance among other RCRL algorithms satisfying the constraints.

Spectral-Risk Safe Reinforcement Learning with Convergence Guarantees

TL;DR

This work tackles risk-constrained reinforcement learning where nonlinear spectral risk measures impede convergence. It introduces SRCPO, a bilevel optimization framework that leverages the dual form of spectral risk measures to separate inner policy optimization from outer dual-variable updates, with a discretized spectrum and a sampler over the discretization parameters to guarantee convergence in the tabular setting. The inner problem uses novel risk value functions and a policy-gradient-like update with convergence guarantees, while the outer problem searches for an optimal dual representation via a sampler over the discretized risk function, enabling joint training. Empirically, SRCPO achieves top performance among constrained RCRL methods on continuous control tasks while strictly satisfying risk constraints, and it supports multiple risk measures through spectrum discretization. These results provide a principled and scalable approach for safe RL in safety-critical domains where tail risks must be tightly controlled.

Abstract

The field of risk-constrained reinforcement learning (RCRL) has been developed to effectively reduce the likelihood of worst-case scenarios by explicitly handling risk-measure-based constraints. However, the nonlinearity of risk measures makes it challenging to achieve convergence and optimality. To overcome the difficulties posed by the nonlinearity, we propose a spectral risk measure-constrained RL algorithm, spectral-risk-constrained policy optimization (SRCPO), a bilevel optimization approach that utilizes the duality of spectral risk measures. In the bilevel optimization structure, the outer problem involves optimizing dual variables derived from the risk measures, while the inner problem involves finding an optimal policy given these dual variables. The proposed method, to the best of our knowledge, is the first to guarantee convergence to an optimum in the tabular setting. Furthermore, the proposed method has been evaluated on continuous control tasks and showed the best performance among other RCRL algorithms satisfying the constraints.
Paper Structure (29 sections, 14 theorems, 95 equations, 8 figures, 4 tables, 2 algorithms)

This paper contains 29 sections, 14 theorems, 95 equations, 8 figures, 4 tables, 2 algorithms.

Key Result

Lemma 5.0

Assuming that a function $g$ is differentiable, $V^\pi_{i,g}(\bar{s})$ and $Q^\pi_{i,g}(\bar{s},a)$ are bounded by $[g(e_i b), g(e_i b + b C_\mathrm{max}/(1-\gamma))]$, and $|Q^\pi_{i,g}(\bar{s},a) - V^\pi_{i,g}(\bar{s})| \leq b C$, where $C=\frac{C_\mathrm{max}}{1-\gamma}g'(\frac{C_\mathrm{max}}{1

Figures (8)

  • Figure 1: Discretization of spectrum.
  • Figure 2: Training curves of the legged robot locomotion tasks. The upper graph shows results for the quadrupedal robot, and the lower one is for the bipedal robot. The solid line in each graph represents the average of each metric, and the shaded area indicates the standard deviation scaled by $0.5$. The results are obtained by training each algorithm with five random seeds.
  • Figure 3: Training curves of the Safety Gymnasium tasks. The results for each task are displayed in columns, titled with the task name. The solid line represents the average of each metric, and the shaded area indicates the standard deviation scaled by $0.2$. The results are obtained by training each algorithm with five random seeds.
  • Figure 4: (Left) A correlation graph between cost rate and reward sum for policies trained in the point goal task under various risk measure constraints. The results are achieved by training policies with five random seeds for each risk measure and risk level. The center and radius of each ellipse show the average and standard deviation of the results from the five seeds, respectively. (Middle) Distribution graphs of the cost rate under different risk measure constraints. Locations of several percentiles (from the $50$th to the $99$th) are marked on the plot. The risk level of each risk measure is selected to have a similar cost rate. After training a policy in the point goal task, cost distributions have been collected by rolling out the trained policy across 500 episodes. (Right) Distribution graphs of the cost rate with different risk levels, $\alpha$, under the CVaR constraint.
  • Figure 5: Rendered images of the Safety Gymnasium and the legged robot locomotion tasks.
  • ...and 3 more figures

Theorems & Definitions (22)

  • Lemma 5.0
  • Theorem 5.1
  • Theorem 5.2
  • Lemma 6.0: Approximation Error
  • Theorem 6.1
  • Theorem 6.2
  • Lemma A.0
  • proof
  • Theorem A.1
  • proof
  • ...and 12 more