Table of Contents
Fetching ...

Tilted Quantile Gradient Updates for Quantile-Constrained Reinforcement Learning

Chenglin Li, Guangchun Ruan, Hua Geng

TL;DR

The paper tackles high-probability safety in reinforcement learning by replacing expectation-based safety constraints with quantile constraints, specifically enforcing $q_{1-\varepsilon}(\pi_\theta) \le d$. It introduces Tilted Quantile Policy Optimization (TQPO), which directly estimates the gradient of the quantile via sampling and employs a tilted update to the Lagrange multiplier to counteract asymmetries in the quantile distribution, integrated into a PPO-based policy optimization. The authors provide convergence proofs for the two-timescale and three-timescale updates, and demonstrate through experiments on three safe RL benchmarks that TQPO strictly satisfies the quantile safety constraint while achieving higher returns and reduced training time compared to state-of-the-art baselines. The work offers a practical approach to robust safety in RL, with implications for real-world deployment where high-confidence safety guarantees are essential.

Abstract

Safe reinforcement learning (RL) is a popular and versatile paradigm to learn reward-maximizing policies with safety guarantees. Previous works tend to express the safety constraints in an expectation form due to the ease of implementation, but this turns out to be ineffective in maintaining safety constraints with high probability. To this end, we move to the quantile-constrained RL that enables a higher level of safety without any expectation-form approximations. We directly estimate the quantile gradients through sampling and provide the theoretical proofs of convergence. Then a tilted update strategy for quantile gradients is implemented to compensate the asymmetric distributional density, with a direct benefit of return performance. Experiments demonstrate that the proposed model fully meets safety requirements (quantile constraints) while outperforming the state-of-the-art benchmarks with higher return.

Tilted Quantile Gradient Updates for Quantile-Constrained Reinforcement Learning

TL;DR

The paper tackles high-probability safety in reinforcement learning by replacing expectation-based safety constraints with quantile constraints, specifically enforcing . It introduces Tilted Quantile Policy Optimization (TQPO), which directly estimates the gradient of the quantile via sampling and employs a tilted update to the Lagrange multiplier to counteract asymmetries in the quantile distribution, integrated into a PPO-based policy optimization. The authors provide convergence proofs for the two-timescale and three-timescale updates, and demonstrate through experiments on three safe RL benchmarks that TQPO strictly satisfies the quantile safety constraint while achieving higher returns and reduced training time compared to state-of-the-art baselines. The work offers a practical approach to robust safety in RL, with implications for real-world deployment where high-confidence safety guarantees are essential.

Abstract

Safe reinforcement learning (RL) is a popular and versatile paradigm to learn reward-maximizing policies with safety guarantees. Previous works tend to express the safety constraints in an expectation form due to the ease of implementation, but this turns out to be ineffective in maintaining safety constraints with high probability. To this end, we move to the quantile-constrained RL that enables a higher level of safety without any expectation-form approximations. We directly estimate the quantile gradients through sampling and provide the theoretical proofs of convergence. Then a tilted update strategy for quantile gradients is implemented to compensate the asymmetric distributional density, with a direct benefit of return performance. Experiments demonstrate that the proposed model fully meets safety requirements (quantile constraints) while outperforming the state-of-the-art benchmarks with higher return.

Paper Structure

This paper contains 17 sections, 5 theorems, 18 equations, 4 figures, 2 tables.

Key Result

Lemma 1

For any $\overline{\theta}\in \Theta$, the ODE $\dot{q}(t)=g_1(\overline{\theta},q)$ has the unique global asymptotically stable equilibrium $q_{\overline{\theta}}$.

Figures (4)

  • Figure 1: Safety Gym simulation environments
  • Figure 2: Average Cost (Row 1) and Cost Quantile (Row 2) of three algorithms on SimpleEnv (Column 1), DynamicEnv (Column 2) and GremlinEnv (Column 3). The cost quantile of PPO-Lag is calculated with $1-\varepsilon=90\%$
  • Figure 3: Return (Row 1) and Safety Probability (Row 2) of three algorithms on SimpleEnv (Column 1), DynamicEnv (Column 2) and GremlinEnv (Column 3).
  • Figure 4: Distributions of quantile $q_{1-\varepsilon}$ w.o. (top) and w. (bottom) tilted term. The black vertical dashed line is the threshold $d$, $\Delta \lambda^+$ is the increase of $\lambda$ when $q_{1-\varepsilon}\ge d$, $\Delta \lambda^-$ represent the decrease of $\lambda$ when $q_{1-\varepsilon}< d$.

Theorems & Definitions (5)

  • Lemma 1
  • Theorem 1
  • Lemma 2
  • Lemma 3
  • Theorem 2