Universal Black-Box Reward Poisoning Attack against Offline Reinforcement Learning

Yinglun Xu; Rohan Gumaste; Gagandeep Singh

Universal Black-Box Reward Poisoning Attack against Offline Reinforcement Learning

Yinglun Xu, Rohan Gumaste, Gagandeep Singh

TL;DR

This work proposes the first universal black-box reward poisoning attack in the general offline RL setting, and provides theoretical insights on the attack design and empirically shows that the attack is efficient against current state-of-the-art offline RL algorithms in different learning datasets.

Abstract

We study the problem of universal black-boxed reward poisoning attacks against general offline reinforcement learning with deep neural networks. We consider a black-box threat model where the attacker is entirely oblivious to the learning algorithm, and its budget is limited by constraining the amount of corruption at each data point and the total perturbation. We require the attack to be universally efficient against any efficient algorithms that might be used by the agent. We propose an attack strategy called the `policy contrast attack.' The idea is to find low- and high-performing policies covered by the dataset and make them appear to be high- and low-performing to the agent, respectively. To the best of our knowledge, we propose the first universal black-box reward poisoning attack in the general offline RL setting. We provide theoretical insights on the attack design and empirically show that our attack is efficient against current state-of-the-art offline RL algorithms in different learning datasets.

Universal Black-Box Reward Poisoning Attack against Offline Reinforcement Learning

TL;DR

Abstract

Paper Structure (18 sections, 5 theorems, 5 equations, 7 figures, 1 table, 2 algorithms)

This paper contains 18 sections, 5 theorems, 5 equations, 7 figures, 1 table, 2 algorithms.

Introduction
Related Work
Preliminaries
Policy Contrast Attack
Universal Attack on General Offline Reinforcement Learning as An Optimization Problem
Adversarial Reward Engineering Framework
Towards Efficient Adversarial Reward Engineering Attack: Policy Contrast Attack (PCA)
Experiments
Performance of Different Attack Strategy
Universal Attack and Robustness Evaluation
Influence of Attack Budget
Conclusion and Limitation
Reproducibility
Proof for Theorems and Lemmas
Inverted Reward Attack
...and 3 more sections

Key Result

Theorem 4.4

Let $\widehat{\mathcal{R}}$ be the adversarial reward of an instance of the adversarial reward engineering framework. Let $\hat{\Pi}^*:=\{\pi|\pi \in \Pi_\mu, J_{\widehat{R}}(\hat{\pi}^*) \geq \max_{\pi \in \Pi_\mu} J_{\widehat{\mathcal{R}}}(\pi) -\delta\}$ be the $\delta$-optimal supported policies

Figures (7)

Figure 1: Reward poisoning attack framework.
Figure 2: Peformance of different learning algorithms on the same dataset under the attacks.
Figure 3: Influence of different $B$ budget on the attack.
Figure 4: Influence of different $C$ budget on the attack.
Figure 5: Peformance of learning algorithms on different datasets under the attacks. We take the title of the first figure as an example to explain the meaning of the title. 'HalfCheetah' means the RL environment is HalfCheetah; 'Medium Expert' means the dataset is collected by a mixture of medium and expert policies. 'IQL attacks TD3_BC' means the learning algorithm used by the attacker is IQL, and the one used by the learning agent is TD3_BC.
...and 2 more figures

Theorems & Definitions (9)

Definition 4.2
Definition 4.3
Theorem 4.4
Theorem 4.5
Lemma 4.6
Lemma 4.7
Definition B.1: Fully inverted reward attack
Theorem B.2
Definition B.3: Random inverted reward attack

Universal Black-Box Reward Poisoning Attack against Offline Reinforcement Learning

TL;DR

Abstract

Universal Black-Box Reward Poisoning Attack against Offline Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (9)