Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning
Jialiang Fan, Shixiong Jiang, Mengyu Liu, Fanxin Kong
TL;DR
This paper addresses the vulnerability of Safe RL policies to adversarial perturbations in real-world settings by introducing an adversarial attack framework based on inverse constrained reinforcement learning (ICRL). The approach learns a surrogate safety constraint $\psi$ and a learner policy $\pi_L$ from expert demonstrations via black-box interaction, enabling gradient-based perturbations without access to the victim's gradients or ground-truth constraints. The authors provide theoretical results showing the feasibility and bounds of such attacks, and empirically demonstrate effective safety-violation induction across multiple Safe RL benchmarks under constrained budgets. The work highlights practical risks in Safe RL and offers a foundation for developing more robust policies through defense strategies like adversarial training and explicit state-constraint encoding.
Abstract
Safe reinforcement learning (Safe RL) aims to ensure policy performance while satisfying safety constraints. However, most existing Safe RL methods assume benign environments, making them vulnerable to adversarial perturbations commonly encountered in real-world settings. In addition, existing gradient-based adversarial attacks typically require access to the policy's gradient information, which is often impractical in real-world scenarios. To address these challenges, we propose an adversarial attack framework to reveal vulnerabilities of Safe RL policies. Using expert demonstrations and black-box environment interaction, our framework learns a constraint model and a surrogate (learner) policy, enabling gradient-based attack optimization without requiring the victim policy's internal gradients or the ground-truth safety constraints. We further provide theoretical analysis establishing feasibility and deriving perturbation bounds. Experiments on multiple Safe RL benchmarks demonstrate the effectiveness of our approach under limited privileged access.
