Table of Contents
Fetching ...

Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

Jialiang Fan, Shixiong Jiang, Mengyu Liu, Fanxin Kong

TL;DR

This paper addresses the vulnerability of Safe RL policies to adversarial perturbations in real-world settings by introducing an adversarial attack framework based on inverse constrained reinforcement learning (ICRL). The approach learns a surrogate safety constraint $\psi$ and a learner policy $\pi_L$ from expert demonstrations via black-box interaction, enabling gradient-based perturbations without access to the victim's gradients or ground-truth constraints. The authors provide theoretical results showing the feasibility and bounds of such attacks, and empirically demonstrate effective safety-violation induction across multiple Safe RL benchmarks under constrained budgets. The work highlights practical risks in Safe RL and offers a foundation for developing more robust policies through defense strategies like adversarial training and explicit state-constraint encoding.

Abstract

Safe reinforcement learning (Safe RL) aims to ensure policy performance while satisfying safety constraints. However, most existing Safe RL methods assume benign environments, making them vulnerable to adversarial perturbations commonly encountered in real-world settings. In addition, existing gradient-based adversarial attacks typically require access to the policy's gradient information, which is often impractical in real-world scenarios. To address these challenges, we propose an adversarial attack framework to reveal vulnerabilities of Safe RL policies. Using expert demonstrations and black-box environment interaction, our framework learns a constraint model and a surrogate (learner) policy, enabling gradient-based attack optimization without requiring the victim policy's internal gradients or the ground-truth safety constraints. We further provide theoretical analysis establishing feasibility and deriving perturbation bounds. Experiments on multiple Safe RL benchmarks demonstrate the effectiveness of our approach under limited privileged access.

Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

TL;DR

This paper addresses the vulnerability of Safe RL policies to adversarial perturbations in real-world settings by introducing an adversarial attack framework based on inverse constrained reinforcement learning (ICRL). The approach learns a surrogate safety constraint and a learner policy from expert demonstrations via black-box interaction, enabling gradient-based perturbations without access to the victim's gradients or ground-truth constraints. The authors provide theoretical results showing the feasibility and bounds of such attacks, and empirically demonstrate effective safety-violation induction across multiple Safe RL benchmarks under constrained budgets. The work highlights practical risks in Safe RL and offers a foundation for developing more robust policies through defense strategies like adversarial training and explicit state-constraint encoding.

Abstract

Safe reinforcement learning (Safe RL) aims to ensure policy performance while satisfying safety constraints. However, most existing Safe RL methods assume benign environments, making them vulnerable to adversarial perturbations commonly encountered in real-world settings. In addition, existing gradient-based adversarial attacks typically require access to the policy's gradient information, which is often impractical in real-world scenarios. To address these challenges, we propose an adversarial attack framework to reveal vulnerabilities of Safe RL policies. Using expert demonstrations and black-box environment interaction, our framework learns a constraint model and a surrogate (learner) policy, enabling gradient-based attack optimization without requiring the victim policy's internal gradients or the ground-truth safety constraints. We further provide theoretical analysis establishing feasibility and deriving perturbation bounds. Experiments on multiple Safe RL benchmarks demonstrate the effectiveness of our approach under limited privileged access.
Paper Structure (27 sections, 4 theorems, 31 equations, 2 figures, 4 tables, 2 algorithms)

This paper contains 27 sections, 4 theorems, 31 equations, 2 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

Let $\psi(s,a)$ be a constraint function learned via ICRL that satisfies $\forall (s,a), ~ |\psi(s,a) - \psi^*(s,a)| \leq \xi,$ where $\psi^*$ denotes the ground-truth constraint. Let $\pi_E$ be the expert policy, and let $\delta$ be a perturbation such that $\psi(s + \delta, \pi_E(s + \delta)) > \e

Figures (2)

  • Figure 1: An overview of the proposed adversarial attack framework.
  • Figure 2: Adversarial attack results across four environments. Top row: reward curves. Bottom row: cost curves. From left to right: AntPosition, AntVelocity, BallRun, BallCircle. Our method (blue line) achieves consistently strong safety violations under L1 assumptions compared to all baselines that require gradient access (L2).

Theorems & Definitions (8)

  • Theorem 1: Feasibility of Constraint-Based Perturbations
  • Lemma 1: Local Optimality of Gradient-Based Attacks
  • Lemma 2: One-Step Perturbation Cost Value Bound
  • Lemma 3: Episodic Perturbation Cost Value Bound
  • proof
  • proof
  • proof
  • proof