Rethinking Adversarial Policies: A Generalized Attack Formulation and Provable Defense in RL

Xiangyu Liu; Souradip Chakraborty; Yanchao Sun; Furong Huang

Rethinking Adversarial Policies: A Generalized Attack Formulation and Provable Defense in RL

Xiangyu Liu, Souradip Chakraborty, Yanchao Sun, Furong Huang

TL;DR

A generalized attack framework that has the flexibility to model to what extent the adversary is able to control the agent, and allows the attacker to regulate the state distribution shift and produce stealthier adversarial policies is introduced.

Abstract

Most existing works focus on direct perturbations to the victim's state/action or the underlying transition dynamics to demonstrate the vulnerability of reinforcement learning agents to adversarial attacks. However, such direct manipulations may not be always realizable. In this paper, we consider a multi-agent setting where a well-trained victim agent $ν$ is exploited by an attacker controlling another agent $α$ with an \textit{adversarial policy}. Previous models do not account for the possibility that the attacker may only have partial control over $α$ or that the attack may produce easily detectable "abnormal" behaviors. Furthermore, there is a lack of provably efficient defenses against these adversarial policies. To address these limitations, we introduce a generalized attack framework that has the flexibility to model to what extent the adversary is able to control the agent, and allows the attacker to regulate the state distribution shift and produce stealthier adversarial policies. Moreover, we offer a provably efficient defense with polynomial convergence to the most robust victim policy through adversarial training with timescale separation. This stands in sharp contrast to supervised learning, where adversarial training typically provides only \textit{empirical} defenses. Using the Robosumo competition experiments, we show that our generalized attack formulation results in much stealthier adversarial policies when maintaining the same winning rate as baselines. Additionally, our adversarial training approach yields stable learning dynamics and less exploitable victim policies.

Rethinking Adversarial Policies: A Generalized Attack Formulation and Provable Defense in RL

TL;DR

Abstract

is exploited by an attacker controlling another agent

with an \textit{adversarial policy}. Previous models do not account for the possibility that the attacker may only have partial control over

or that the attack may produce easily detectable "abnormal" behaviors. Furthermore, there is a lack of provably efficient defenses against these adversarial policies. To address these limitations, we introduce a generalized attack framework that has the flexibility to model to what extent the adversary is able to control the agent, and allows the attacker to regulate the state distribution shift and produce stealthier adversarial policies. Moreover, we offer a provably efficient defense with polynomial convergence to the most robust victim policy through adversarial training with timescale separation. This stands in sharp contrast to supervised learning, where adversarial training typically provides only \textit{empirical} defenses. Using the Robosumo competition experiments, we show that our generalized attack formulation results in much stealthier adversarial policies when maintaining the same winning rate as baselines. Additionally, our adversarial training approach yields stable learning dynamics and less exploitable victim policies.

Paper Structure (27 sections, 9 theorems, 23 equations, 7 figures, 4 tables, 2 algorithms)

This paper contains 27 sections, 9 theorems, 23 equations, 7 figures, 4 tables, 2 algorithms.

Introduction
Preliminaries
A generalized attack formulation
Improved adversarial training with timescale separation
Theoretical analysis
Related work
Experiments
Discussion and limitations
Acknowledgement
Appendix for "Rethinking Adversarial Policies: A Generalized Attack Formulation and Provable Defense in RL"
Additional related work
Relationship between NE and robustness
Motivation and examples of timescale separation
Full proof
Proof of Proposition \ref{['theorem:1']} and \ref{['theorem:2_new']}
...and 12 more sections

Key Result

Proposition 3.1

For two policy pairs $(\widehat{\pi}_{\nu}, \widehat{\pi}_{\alpha})$ and $(\widehat{\pi}_{\nu}, \pi_{\alpha})$ such that $D_{\operatorname{TV}}^{\max}(\pi_{\alpha}||\widehat{\pi}_{\alpha})\le \epsilon_\pi$, the difference between the victim value can be bounded as: $|V_{\rho}(\widehat{\pi}_\nu, \wid

Figures (7)

Figure 1: Visualization and comparison of our proposed constrained attack with $\epsilon_\pi=0.2$ (first row) vs. an unconstrained attack (second row, $\epsilon_\pi = 1$), under the condition that both achieve the same attacking success rate. The most important state features are shown. It is clear that our constrained adversarial policy induces much smaller state distribution shifts.
Figure 2: Exploitability of victim policy in Kuhn Poker trained by two timescale and single timescale (min indicates the policy trained with a min oracle).
Figure 3: State-distribution shift w.r.t Wasserstein-2 distance (squared) incurred due to the Global ($\epsilon_\pi = 1$), Local_1 ($\epsilon_\pi = 0.7$), Local_2($\epsilon_\pi = 0.3$) attacks in $3$ Robosumo environments.
Figure 4: (a). The score of the victim policy trained with two timescales converges rapidly, while the policy trained only with a single timescale suffers from much more oscillations. (b). The gradient norm trained by two timescales is also much smaller. (c). Under different $\epsilon_\pi = 1, 0.7, 0.3$, when attacking the robustified victim policy, i.e. computing $\min_{\pi_{\alpha}}V_{\rho}(\pi_{\nu}^{\star}, (1-\epsilon_\pi)\widehat{\pi}+\epsilon_\pi\pi_\alpha)$ with standard RL algorithm, victim trained by two timescales achieves the lowest exploitability/best robustness.
Figure 5: Exploitability test on Rock-Paper-Scissor. Note that the green line is overlapped with the red.
...and 2 more figures

Theorems & Definitions (16)

Definition 2.1
Proposition 3.1: Bounded policy discrepancy induces bounded value discrepancy
Proposition 3.2: Bounded policy discrepancy induces bounded state distribution discrepancy
Proposition 3.3: Bounded policy discrepancy induces bounded marginalized transition dynamics inconsistencies
Definition 4.1: (One-side) exploitability
Definition 5.1: Direct parameterization
Definition 5.2
Theorem 5.3
Remark 5.4
Definition B.1: Nash equilibrium
...and 6 more

Rethinking Adversarial Policies: A Generalized Attack Formulation and Provable Defense in RL

TL;DR

Abstract

Rethinking Adversarial Policies: A Generalized Attack Formulation and Provable Defense in RL

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (16)