Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

Jialiang Fan; Shixiong Jiang; Mengyu Liu; Fanxin Kong

Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

Jialiang Fan, Shixiong Jiang, Mengyu Liu, Fanxin Kong

TL;DR

This paper addresses the vulnerability of Safe RL policies to adversarial perturbations in real-world settings by introducing an adversarial attack framework based on inverse constrained reinforcement learning (ICRL). The approach learns a surrogate safety constraint $\psi$ and a learner policy $\pi_L$ from expert demonstrations via black-box interaction, enabling gradient-based perturbations without access to the victim's gradients or ground-truth constraints. The authors provide theoretical results showing the feasibility and bounds of such attacks, and empirically demonstrate effective safety-violation induction across multiple Safe RL benchmarks under constrained budgets. The work highlights practical risks in Safe RL and offers a foundation for developing more robust policies through defense strategies like adversarial training and explicit state-constraint encoding.

Abstract

Safe reinforcement learning (Safe RL) aims to ensure policy performance while satisfying safety constraints. However, most existing Safe RL methods assume benign environments, making them vulnerable to adversarial perturbations commonly encountered in real-world settings. In addition, existing gradient-based adversarial attacks typically require access to the policy's gradient information, which is often impractical in real-world scenarios. To address these challenges, we propose an adversarial attack framework to reveal vulnerabilities of Safe RL policies. Using expert demonstrations and black-box environment interaction, our framework learns a constraint model and a surrogate (learner) policy, enabling gradient-based attack optimization without requiring the victim policy's internal gradients or the ground-truth safety constraints. We further provide theoretical analysis establishing feasibility and deriving perturbation bounds. Experiments on multiple Safe RL benchmarks demonstrate the effectiveness of our approach under limited privileged access.

Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

TL;DR

and a learner policy

from expert demonstrations via black-box interaction, enabling gradient-based perturbations without access to the victim's gradients or ground-truth constraints. The authors provide theoretical results showing the feasibility and bounds of such attacks, and empirically demonstrate effective safety-violation induction across multiple Safe RL benchmarks under constrained budgets. The work highlights practical risks in Safe RL and offers a foundation for developing more robust policies through defense strategies like adversarial training and explicit state-constraint encoding.

Abstract

Paper Structure (27 sections, 4 theorems, 31 equations, 2 figures, 4 tables, 2 algorithms)

This paper contains 27 sections, 4 theorems, 31 equations, 2 figures, 4 tables, 2 algorithms.

Introduction
Related Work
Problem Formulation
Safe Reinforcement Learning
Threat Model
Attack Metrics
Learning Constraints via ICRL
Method
Attack Generation
End-to-end attack pipeline
Theoretical Analysis of Adversarial Attack
Remark 1.
Experiments
Experimental Setup
ICRL Results
...and 12 more sections

Key Result

Theorem 1

Let $\psi(s,a)$ be a constraint function learned via ICRL that satisfies $\forall (s,a), ~ |\psi(s,a) - \psi^*(s,a)| \leq \xi,$ where $\psi^*$ denotes the ground-truth constraint. Let $\pi_E$ be the expert policy, and let $\delta$ be a perturbation such that $\psi(s + \delta, \pi_E(s + \delta)) > \e

Figures (2)

Figure 1: An overview of the proposed adversarial attack framework.
Figure 2: Adversarial attack results across four environments. Top row: reward curves. Bottom row: cost curves. From left to right: AntPosition, AntVelocity, BallRun, BallCircle. Our method (blue line) achieves consistently strong safety violations under L1 assumptions compared to all baselines that require gradient access (L2).

Theorems & Definitions (8)

Theorem 1: Feasibility of Constraint-Based Perturbations
Lemma 1: Local Optimality of Gradient-Based Attacks
Lemma 2: One-Step Perturbation Cost Value Bound
Lemma 3: Episodic Perturbation Cost Value Bound
proof
proof
proof
proof

Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

TL;DR

Abstract

Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (8)