Table of Contents
Fetching ...

Towards Optimal Adversarial Robust Reinforcement Learning with Infinity Measurement Error

Haoran Li, Zicheng Zhang, Wang Luo, Congying Han, Jiayu Lv, Tiande Guo, Yudong Hu

TL;DR

This work establishesIntrinsic State-adversarial MDP (ISA-MDP) as a universal framework for state-adversarial decision making and proves the existence of a deterministic, stationary Optimal Robust Policy (ORP) that coincides with the Bellman optimal policy within ISA-MDP. It demonstrates that achieving ORP requires infinity-measurement error considerations in both action-value and probability spaces, and shows that traditional 1-measurement-error approaches can yield vulnerability. Building on this theory, the Consistent Adversarial Robust Reinforcement Learning (CAR-RL) framework optimizes surrogates of infinity-measurement errors, instantiated as CAR-DQN for value-based and CAR-PPO for policy-based methods, and validated with extensive Atari and MuJoCo experiments. The results indicate improved natural and robust performance, plus stability and consistency between natural returns and adversarial robustness, providing a principled path to deploying robust DRL agents in real-world settings where adversarial perturbations are possible but not ubiquitous.

Abstract

Ensuring the robustness of deep reinforcement learning (DRL) agents against adversarial attacks is critical for their trustworthy deployment. Recent research highlights the challenges of achieving state-adversarial robustness and suggests that an optimal robust policy (ORP) does not always exist, complicating the enforcement of strict robustness constraints. In this paper, we further explore the concept of ORP. We first introduce the Intrinsic State-adversarial Markov Decision Process (ISA-MDP), a novel formulation where adversaries cannot fundamentally alter the intrinsic nature of state observations. ISA-MDP, supported by empirical and theoretical evidence, universally characterizes decision-making under state-adversarial paradigms. We rigorously prove that within ISA-MDP, a deterministic and stationary ORP exists, aligning with the Bellman optimal policy. Our findings theoretically reveal that improving DRL robustness does not necessarily compromise performance in natural environments. Furthermore, we demonstrate the necessity of infinity measurement error (IME) in both $Q$-function and probability spaces to achieve ORP, unveiling vulnerabilities of previous DRL algorithms that rely on $1$-measurement errors. Motivated by these insights, we develop the Consistent Adversarial Robust Reinforcement Learning (CAR-RL) framework, which optimizes surrogates of IME. We apply CAR-RL to both value-based and policy-based DRL algorithms, achieving superior performance and validating our theoretical analysis.

Towards Optimal Adversarial Robust Reinforcement Learning with Infinity Measurement Error

TL;DR

This work establishesIntrinsic State-adversarial MDP (ISA-MDP) as a universal framework for state-adversarial decision making and proves the existence of a deterministic, stationary Optimal Robust Policy (ORP) that coincides with the Bellman optimal policy within ISA-MDP. It demonstrates that achieving ORP requires infinity-measurement error considerations in both action-value and probability spaces, and shows that traditional 1-measurement-error approaches can yield vulnerability. Building on this theory, the Consistent Adversarial Robust Reinforcement Learning (CAR-RL) framework optimizes surrogates of infinity-measurement errors, instantiated as CAR-DQN for value-based and CAR-PPO for policy-based methods, and validated with extensive Atari and MuJoCo experiments. The results indicate improved natural and robust performance, plus stability and consistency between natural returns and adversarial robustness, providing a principled path to deploying robust DRL agents in real-world settings where adversarial perturbations are possible but not ubiquitous.

Abstract

Ensuring the robustness of deep reinforcement learning (DRL) agents against adversarial attacks is critical for their trustworthy deployment. Recent research highlights the challenges of achieving state-adversarial robustness and suggests that an optimal robust policy (ORP) does not always exist, complicating the enforcement of strict robustness constraints. In this paper, we further explore the concept of ORP. We first introduce the Intrinsic State-adversarial Markov Decision Process (ISA-MDP), a novel formulation where adversaries cannot fundamentally alter the intrinsic nature of state observations. ISA-MDP, supported by empirical and theoretical evidence, universally characterizes decision-making under state-adversarial paradigms. We rigorously prove that within ISA-MDP, a deterministic and stationary ORP exists, aligning with the Bellman optimal policy. Our findings theoretically reveal that improving DRL robustness does not necessarily compromise performance in natural environments. Furthermore, we demonstrate the necessity of infinity measurement error (IME) in both -function and probability spaces to achieve ORP, unveiling vulnerabilities of previous DRL algorithms that rely on -measurement errors. Motivated by these insights, we develop the Consistent Adversarial Robust Reinforcement Learning (CAR-RL) framework, which optimizes surrogates of IME. We apply CAR-RL to both value-based and policy-based DRL algorithms, achieving superior performance and validating our theoretical analysis.

Paper Structure

This paper contains 75 sections, 47 theorems, 261 equations, 17 figures, 9 tables, 2 algorithms.

Key Result

Theorem 2

For any MDP $\mathcal{M}$, let $\mathcal{S}_{nu}$ denote the state set where the optimal action is not unique, i.e., $\mathcal{S}_{nu} = \left\{ s\in\mathcal{S} | \mathop{\arg\max}_a Q^*(s,a) \text{ is not a singleton} \right\}$. Given $\epsilon > 0$, let $\mathcal{S}_{nin}$ denote the set of states where $\mathcal{S}_{0}$ is the set of discontinuous points that cause the optimal action to change,

Figures (17)

  • Figure 1: An example of state adversary in DQN. While the adversary disrupts the policy executed by DQN, it does not affect the optimal action prescribed by the Bellman optimal policy. This observation leads us to examine two critical issues: whether the Bellman optimal policy serves as the ORP, and why vanilla DQN trained with Bellman error fails to achieve robustness.
  • Figure 2: Episode rewards of CAR-DQN agents with and without 10-step PGD attacks on 4 Atari games and 5 random seeds. As evidenced by the overlap of the two curves, CAR-DQN achieves the consistency between Bellman optimal policy and ORP.
  • Figure 3: Examples of adversarial robustness for $Q$ satisfying $\|Q-Q^*\|_p\le\delta$. Given a perturbation radius $\epsilon$, the red line represents the set $\mathcal{S}_{adv}^Q$, in which states have adversarial states. The left panel depicts the case of $p=\infty$. In this scenario, all such $Q$ functions are distributed within the shadow area, with the measure of $\mathcal{S}_{adv}^Q$ being a small value, approximately $2 \epsilon + O\left( \delta \right)$, indicating good robustness. In contrast, the right panel shows that for $1\le p<\infty$, there always exists a $Q$ function such that $\mathcal{S}_{adv}^Q = \mathcal{S}$, indicating poor robustness.
  • Figure 4: Examples of adversarial robustness for the policy $\pi$ satisfying $\mathcal{D}_{k, \operatorname{KL}}^{\mu} \left( \varphi \| \pi \right) \le \delta$. Given a perturbation radius $\epsilon$, the red line represents the set $\mathcal{S}_{adv}^{\pi,\epsilon}$, which consists of states with adversarial states. In the left panel, we depict the case of $k = \infty$. In this scenario, all such policies $\pi$ are distributed within the shadow area, with the measure of $\mathcal{S}_{adv}^{\pi,\epsilon}$ being a small value, approximately $2\epsilon + O\left(h(\delta)\right)$, indicating good robustness. In contrast, the right panel illustrates that for $1 \leq k < \infty$, there always exists $\pi$ such that $\mathcal{S}_{adv}^{\pi,\epsilon} = \mathcal{S}$ in the worst case, indicating poor robustness.
  • Figure 5: Natural reward and worst-case robustness under various attacks in MuJoCo.
  • ...and 12 more figures

Theorems & Definitions (79)

  • Definition 1: Intrinsic State Neighborhood
  • Theorem 2: Sparse Difference Between Intrinsic and Standard Neighborhood
  • Definition 3: Intrinsic Adversary
  • Definition 4: Intrinsic State-adversarial Markov Decision Process (ISA-MDP)
  • Definition 5: Consistent Adversarial Robust (CAR) Operator $\mathcal{T}_{car}$
  • Theorem 6: Relation between $Q^*$ and $Q^{\pi^*\circ \nu^*(\pi^*)}$
  • Remark 7
  • Corollary 8: Existence of ORP
  • Theorem 9: Convergence of CAR Operator $\mathcal{T}_{car}$
  • Theorem 10: Necessity of the $L^\infty$ Space for Adversarial Robustness
  • ...and 69 more