Towards Optimal Adversarial Robust Q-learning with Bellman Infinity-error

Haoran Li; Zicheng Zhang; Wang Luo; Congying Han; Yudong Hu; Tiande Guo; Shichen Liao

Towards Optimal Adversarial Robust Q-learning with Bellman Infinity-error

Haoran Li, Zicheng Zhang, Wang Luo, Congying Han, Yudong Hu, Tiande Guo, Shichen Liao

TL;DR

This work crucially proves the existence of a deterministic and stationary ORP that aligns with the Bellman optimal policy and clarifies the vulnerability of prior DRL algorithms that target the Bellman optimal policy with $L^{1}$-norm.

Abstract

Establishing robust policies is essential to counter attacks or disturbances affecting deep reinforcement learning (DRL) agents. Recent studies explore state-adversarial robustness and suggest the potential lack of an optimal robust policy (ORP), posing challenges in setting strict robustness constraints. This work further investigates ORP: At first, we introduce a consistency assumption of policy (CAP) stating that optimal actions in the Markov decision process remain consistent with minor perturbations, supported by empirical and theoretical evidence. Building upon CAP, we crucially prove the existence of a deterministic and stationary ORP that aligns with the Bellman optimal policy. Furthermore, we illustrate the necessity of $L^{\infty}$-norm when minimizing Bellman error to attain ORP. This finding clarifies the vulnerability of prior DRL algorithms that target the Bellman optimal policy with $L^{1}$-norm and motivates us to train a Consistent Adversarial Robust Deep Q-Network (CAR-DQN) by minimizing a surrogate of Bellman Infinity-error. The top-tier performance of CAR-DQN across various benchmarks validates its practical effectiveness and reinforces the soundness of our theoretical analysis.

Towards Optimal Adversarial Robust Q-learning with Bellman Infinity-error

TL;DR

-norm.

Abstract

-norm when minimizing Bellman error to attain ORP. This finding clarifies the vulnerability of prior DRL algorithms that target the Bellman optimal policy with

-norm and motivates us to train a Consistent Adversarial Robust Deep Q-Network (CAR-DQN) by minimizing a surrogate of Bellman Infinity-error. The top-tier performance of CAR-DQN across various benchmarks validates its practical effectiveness and reinforces the soundness of our theoretical analysis.

Paper Structure (38 sections, 41 theorems, 200 equations, 13 figures, 5 tables, 1 algorithm)

This paper contains 38 sections, 41 theorems, 200 equations, 13 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Optimal Adversarial Robustness
Consistency Assumption of Policy
Consistent Optimal Robust Policy
Policy Robustness under Bellman $p$-error
Necessity of $L^\infty$-norm for Adversarial Robustness
Stability of Bellman Optimality Equations
Consistent Adversarial Robust DQN
Stability of Deep Q-learning
Consistent Adversarial Robust Objective
Experiments
Implementation details
Comparison Results
...and 23 more sections

Key Result

Theorem 4.2

Given $\epsilon>0$, let $\mathcal{S}_{nu} = \left\{ s\in\mathcal{S} | \mathop{\arg\max}_a Q^*(s,a) \text{ is not a singleton} \right\}$, and $\mathcal{S}_{nin} = \left\{ s\in\mathcal{S} | B_\epsilon(s) \neq B^*_\epsilon(s) \right\}$. If $Q^*(\cdot,a)$ is continuous almost everywhere in $\mathcal{S}

Figures (13)

Figure 1: An example of state adversary in DQN. While the adversary disrupts the policy performed by DQN, it does not impact the optimal action dictated by the Bellman optimal policy. This observation prompts the study of two key issues: whether the Bellman optimal policy serves as the ORP, and why vanilla DQN trained with Bellman error fails to ensure robustness.
Figure 2: Examples of adversarial robustness for $Q$ satisfying $\|Q-Q^*\|_p\le\delta$. Given a perturbation radius $\epsilon$, the red line represents the set $\mathcal{S}_{adv}^Q$, in which states have adversarial states. The left panel depicts the case of $p=\infty$, where all $Q$ is distributed in the shadow area with the measure of $\mathcal{S}_{adv}^Q$ being a small value $2 \epsilon + O\left( \delta \right)$. The right panel shows that for $1\le p<\infty$, there always exists $Q$ such that $\mathcal{S}_{adv}^Q = \mathcal{S}$, indicating poor robustness.
Figure 3: Episode rewards of CAR-DQN agents with and without 10-step PGD attacks on 4 Atari games and 5 random seeds. As evidenced by the overlap of the two curves, CAR-DQN achieves the consistency between Bellman optimal policy and ORP.
Figure 4: Episode rewards of baselines and CAR-DQN with and without PGD attacks on 4 Atari games. Shaded regions are computed over 5 random seeds. CAR-DQN demonstrates superior natural and robust performance in all environments.
Figure 5: Natural, PGD attack, and MinBest attack rewards of CAR-DQN with different soft coefficients on RoadRunner game.
...and 8 more figures

Theorems & Definitions (96)

Definition 4.1: Intrinsic State Neighborhood
Theorem 4.2: Rationality of CAP
Definition 4.4: CAR Operator $\mathcal{T}_{car}$
Theorem 4.5: Relation between $Q^*$ and $Q^{\pi^*\circ \nu^*(\pi^*)}$
Remark 4.6
Corollary 4.7: Existence of ORP
Theorem 5.1: Necessity of $L^\infty$-norm
Definition 5.2: Stability of Functional Equations
Theorem 5.3: Stable Properties of $\mathcal{T}_{B}$ in $L^p$ Spaces
Definition 6.1: $(p,d_{\mu_0}^\pi)$-seminorm
...and 86 more

Towards Optimal Adversarial Robust Q-learning with Bellman Infinity-error

TL;DR

Abstract

Towards Optimal Adversarial Robust Q-learning with Bellman Infinity-error

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (96)