Table of Contents
Fetching ...

Robust Deep Reinforcement Learning against Adversarial Behavior Manipulation

Shojiro Yamabe, Kazuto Fukuchi, Jun Sakuma

TL;DR

This study investigates behavior-targeted attacks on reinforcement learning and their countermeasures and proposes time-discounted regularization, the first defense strategy specifically designed for behavior-targeted attacks.

Abstract

This study investigates behavior-targeted attacks on reinforcement learning and their countermeasures. Behavior-targeted attacks aim to manipulate the victim's behavior as desired by the adversary through adversarial interventions in state observations. Existing behavior-targeted attacks have some limitations, such as requiring white-box access to the victim's policy. To address this, we propose a novel attack method using imitation learning from adversarial demonstrations, which works under limited access to the victim's policy and is environment-agnostic. In addition, our theoretical analysis proves that the policy's sensitivity to state changes impacts defense performance, particularly in the early stages of the trajectory. Based on this insight, we propose time-discounted regularization, which enhances robustness against attacks while maintaining task performance. To the best of our knowledge, this is the first defense strategy specifically designed for behavior-targeted attacks.

Robust Deep Reinforcement Learning against Adversarial Behavior Manipulation

TL;DR

This study investigates behavior-targeted attacks on reinforcement learning and their countermeasures and proposes time-discounted regularization, the first defense strategy specifically designed for behavior-targeted attacks.

Abstract

This study investigates behavior-targeted attacks on reinforcement learning and their countermeasures. Behavior-targeted attacks aim to manipulate the victim's behavior as desired by the adversary through adversarial interventions in state observations. Existing behavior-targeted attacks have some limitations, such as requiring white-box access to the victim's policy. To address this, we propose a novel attack method using imitation learning from adversarial demonstrations, which works under limited access to the victim's policy and is environment-agnostic. In addition, our theoretical analysis proves that the policy's sensitivity to state changes impacts defense performance, particularly in the early stages of the trajectory. Based on this insight, we propose time-discounted regularization, which enhances robustness against attacks while maintaining task performance. To the best of our knowledge, this is the first defense strategy specifically designed for behavior-targeted attacks.
Paper Structure (76 sections, 8 theorems, 58 equations, 5 figures, 10 tables, 3 algorithms)

This paper contains 76 sections, 8 theorems, 58 equations, 5 figures, 10 tables, 3 algorithms.

Key Result

Theorem 5.1

Consider an SA-MDP $M = (\mathcal{S}, \mathcal{A}, R, \mathcal{B}, p, \gamma)$ with adversarial policy $\nu$. Let $\pi$ denote the victim’s policy and $\pi_{\text{tgt}}$ the target policy. Assume that the divergence $\mathcal{D}$ admits the following variational representation: where $f$ and $g$ are arbitrary convex and concave functions, respectively, and $d: \mathcal{S}\times\mathcal{A}\times\m

Figures (5)

  • Figure 1: Overview of SA-MDP
  • Figure 2: Attack and defense performances under various attack budgets $\epsilon$ in MuJoCo environments. The horizontal axis represents the attack budget, which indicates the adversary's intervention capability. The vertical axis shows the attack reward, which represents the reward obtained during the attack. Each value represents the average reward over 50 episodes.
  • Figure 3: Attack performance of BIA-ILfD/ILfO with varying amounts of demonstrations. The x-axis shows the number of demonstration episodes, and the y-axis represents the attack reward. The attack budget $\epsilon = 0.3$. Each environment name represents an adversarial task. The solid line and shaded area denote the mean and the standard deviation / 2 over 50 episodes.
  • Figure 4: Attack performance of BIA-ILfD/ILfO with varying attack budget $\epsilon$. The x-axis shows the value of the attack budget, and the y-axis represents the attack reward. The target reward represents the cumulative reward obtained by the target policy and serves as the upper bound for the attack rewards of BIA-ILfD/ILfO. Each environment name represents an adversarial task. The solid line and shaded area denote the mean and the standard deviation / 2 over 50 episodes.
  • Figure 5: Performance of targeted PGD attacks under different attack budgets $\epsilon$. The x-axis represents the attack budget $\epsilon$, and the y-axis represents the attack reward. Each environment name represents an adversarial task. The solid line and shaded area denote the mean and the standard deviation / 2 over 50 episodes.

Theorems & Definitions (13)

  • Theorem 5.1
  • Theorem 6.1
  • Theorem B.1
  • proof
  • Lemma B.1
  • proof
  • Lemma B.2: Theorem 1 of zhang2020robust
  • Theorem B.2
  • proof
  • Lemma B.3
  • ...and 3 more