SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents

Ethan Rathbun; Christopher Amato; Alina Oprea

SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents

Ethan Rathbun, Christopher Amato, Alina Oprea

TL;DR

This work develops ``SleeperNets'' as a universal backdoor attack which exploits a newly proposed threat model and leverages dynamic reward poisoning techniques, and develops a novel poisoning attack framework which interlinks the adversary's objectives with those of finding an optimal policy -- guaranteeing attack success in the limit.

Abstract

Reinforcement learning (RL) is an actively growing field that is seeing increased usage in real-world, safety-critical applications -- making it paramount to ensure the robustness of RL algorithms against adversarial attacks. In this work we explore a particularly stealthy form of training-time attacks against RL -- backdoor poisoning. Here the adversary intercepts the training of an RL agent with the goal of reliably inducing a particular action when the agent observes a pre-determined trigger at inference time. We uncover theoretical limitations of prior work by proving their inability to generalize across domains and MDPs. Motivated by this, we formulate a novel poisoning attack framework which interlinks the adversary's objectives with those of finding an optimal policy -- guaranteeing attack success in the limit. Using insights from our theoretical analysis we develop ``SleeperNets'' as a universal backdoor attack which exploits a newly proposed threat model and leverages dynamic reward poisoning techniques. We evaluate our attack in 6 environments spanning multiple domains and demonstrate significant improvements in attack success over existing methods, while preserving benign episodic return.

SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents

TL;DR

Abstract

Paper Structure (26 sections, 23 equations, 14 figures, 9 tables, 1 algorithm)

This paper contains 26 sections, 23 equations, 14 figures, 9 tables, 1 algorithm.

Introduction
Adversarial Attacks in DRL -- Related Work and Background
Problem Formulation
Threat Model
Theoretical Results
Insufficiency of Static Reward Poisoning
Base Assumptions for Dynamic Reward Poisoning
Dynamic Reward Poisoning Attack Formulation
Theoretical Guarantees of Dynamic Reward Poisoning
Attack Algorithm
Experimental Results
Experimental Setup
SleeperNets Results
Attack Parameter Ablations
Conclusion and Limitations
...and 11 more sections

Figures (14)

Figure 1: Comparison of the inner and outer-loop threat models. In an outer-loop attack the adversary can utilize information about completed episodes when determining their poisoning strategy. This information is not accessible in an inner-loop attack.
Figure 2: (Left) MDP $M_1$ for which static reward poisoning fails to induce the target action $a^+$. (Right) MDP $M_2$ for which static reward poisoning causes the agent to learn a sub-optimal policy.
Figure 3: Comparison of the SleeperNets, BadRL-M, and TrojDRL-W attacks on (Top) Highway Merge and (Bottom) Safety Car in terms of (Left) ASR and (Right) episodic return.
Figure 4: (Top) Ablation with respect to poisoning budget $\beta$ for each attack given a fixed $c = 40$. (Bottom) Ablation with respect to $c$ given a fixed poisoning budget of $0.5\%$. Both experiments were run on Highway Merge with a value of $\alpha = 0$ for SleeperNets.
Figure 5: (Left) MDP $M_1$, (Right) MDP $M_2$
...and 9 more figures

Theorems & Definitions (9)

proof
proof
proof
proof
proof
proof
proof
proof
proof

SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents

TL;DR

Abstract

SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (14)

Theorems & Definitions (9)