Tilted Quantile Gradient Updates for Quantile-Constrained Reinforcement Learning
Chenglin Li, Guangchun Ruan, Hua Geng
TL;DR
The paper tackles high-probability safety in reinforcement learning by replacing expectation-based safety constraints with quantile constraints, specifically enforcing $q_{1-\varepsilon}(\pi_\theta) \le d$. It introduces Tilted Quantile Policy Optimization (TQPO), which directly estimates the gradient of the quantile via sampling and employs a tilted update to the Lagrange multiplier to counteract asymmetries in the quantile distribution, integrated into a PPO-based policy optimization. The authors provide convergence proofs for the two-timescale and three-timescale updates, and demonstrate through experiments on three safe RL benchmarks that TQPO strictly satisfies the quantile safety constraint while achieving higher returns and reduced training time compared to state-of-the-art baselines. The work offers a practical approach to robust safety in RL, with implications for real-world deployment where high-confidence safety guarantees are essential.
Abstract
Safe reinforcement learning (RL) is a popular and versatile paradigm to learn reward-maximizing policies with safety guarantees. Previous works tend to express the safety constraints in an expectation form due to the ease of implementation, but this turns out to be ineffective in maintaining safety constraints with high probability. To this end, we move to the quantile-constrained RL that enables a higher level of safety without any expectation-form approximations. We directly estimate the quantile gradients through sampling and provide the theoretical proofs of convergence. Then a tilted update strategy for quantile gradients is implemented to compensate the asymmetric distributional density, with a direct benefit of return performance. Experiments demonstrate that the proposed model fully meets safety requirements (quantile constraints) while outperforming the state-of-the-art benchmarks with higher return.
