Table of Contents
Fetching ...

Universal Black-Box Reward Poisoning Attack against Offline Reinforcement Learning

Yinglun Xu, Rohan Gumaste, Gagandeep Singh

TL;DR

This work proposes the first universal black-box reward poisoning attack in the general offline RL setting, and provides theoretical insights on the attack design and empirically shows that the attack is efficient against current state-of-the-art offline RL algorithms in different learning datasets.

Abstract

We study the problem of universal black-boxed reward poisoning attacks against general offline reinforcement learning with deep neural networks. We consider a black-box threat model where the attacker is entirely oblivious to the learning algorithm, and its budget is limited by constraining the amount of corruption at each data point and the total perturbation. We require the attack to be universally efficient against any efficient algorithms that might be used by the agent. We propose an attack strategy called the `policy contrast attack.' The idea is to find low- and high-performing policies covered by the dataset and make them appear to be high- and low-performing to the agent, respectively. To the best of our knowledge, we propose the first universal black-box reward poisoning attack in the general offline RL setting. We provide theoretical insights on the attack design and empirically show that our attack is efficient against current state-of-the-art offline RL algorithms in different learning datasets.

Universal Black-Box Reward Poisoning Attack against Offline Reinforcement Learning

TL;DR

This work proposes the first universal black-box reward poisoning attack in the general offline RL setting, and provides theoretical insights on the attack design and empirically shows that the attack is efficient against current state-of-the-art offline RL algorithms in different learning datasets.

Abstract

We study the problem of universal black-boxed reward poisoning attacks against general offline reinforcement learning with deep neural networks. We consider a black-box threat model where the attacker is entirely oblivious to the learning algorithm, and its budget is limited by constraining the amount of corruption at each data point and the total perturbation. We require the attack to be universally efficient against any efficient algorithms that might be used by the agent. We propose an attack strategy called the `policy contrast attack.' The idea is to find low- and high-performing policies covered by the dataset and make them appear to be high- and low-performing to the agent, respectively. To the best of our knowledge, we propose the first universal black-box reward poisoning attack in the general offline RL setting. We provide theoretical insights on the attack design and empirically show that our attack is efficient against current state-of-the-art offline RL algorithms in different learning datasets.
Paper Structure (18 sections, 5 theorems, 5 equations, 7 figures, 1 table, 2 algorithms)

This paper contains 18 sections, 5 theorems, 5 equations, 7 figures, 1 table, 2 algorithms.

Key Result

Theorem 4.4

Let $\widehat{\mathcal{R}}$ be the adversarial reward of an instance of the adversarial reward engineering framework. Let $\hat{\Pi}^*:=\{\pi|\pi \in \Pi_\mu, J_{\widehat{R}}(\hat{\pi}^*) \geq \max_{\pi \in \Pi_\mu} J_{\widehat{\mathcal{R}}}(\pi) -\delta\}$ be the $\delta$-optimal supported policies

Figures (7)

  • Figure 1: Reward poisoning attack framework.
  • Figure 2: Peformance of different learning algorithms on the same dataset under the attacks.
  • Figure 3: Influence of different $B$ budget on the attack.
  • Figure 4: Influence of different $C$ budget on the attack.
  • Figure 5: Peformance of learning algorithms on different datasets under the attacks. We take the title of the first figure as an example to explain the meaning of the title. 'HalfCheetah' means the RL environment is HalfCheetah; 'Medium Expert' means the dataset is collected by a mixture of medium and expert policies. 'IQL attacks TD3_BC' means the learning algorithm used by the attacker is IQL, and the one used by the learning agent is TD3_BC.
  • ...and 2 more figures

Theorems & Definitions (9)

  • Definition 4.2
  • Definition 4.3
  • Theorem 4.4
  • Theorem 4.5
  • Lemma 4.6
  • Lemma 4.7
  • Definition B.1: Fully inverted reward attack
  • Theorem B.2
  • Definition B.3: Random inverted reward attack