Table of Contents
Fetching ...

SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents

Ethan Rathbun, Christopher Amato, Alina Oprea

TL;DR

This work develops ``SleeperNets'' as a universal backdoor attack which exploits a newly proposed threat model and leverages dynamic reward poisoning techniques, and develops a novel poisoning attack framework which interlinks the adversary's objectives with those of finding an optimal policy -- guaranteeing attack success in the limit.

Abstract

Reinforcement learning (RL) is an actively growing field that is seeing increased usage in real-world, safety-critical applications -- making it paramount to ensure the robustness of RL algorithms against adversarial attacks. In this work we explore a particularly stealthy form of training-time attacks against RL -- backdoor poisoning. Here the adversary intercepts the training of an RL agent with the goal of reliably inducing a particular action when the agent observes a pre-determined trigger at inference time. We uncover theoretical limitations of prior work by proving their inability to generalize across domains and MDPs. Motivated by this, we formulate a novel poisoning attack framework which interlinks the adversary's objectives with those of finding an optimal policy -- guaranteeing attack success in the limit. Using insights from our theoretical analysis we develop ``SleeperNets'' as a universal backdoor attack which exploits a newly proposed threat model and leverages dynamic reward poisoning techniques. We evaluate our attack in 6 environments spanning multiple domains and demonstrate significant improvements in attack success over existing methods, while preserving benign episodic return.

SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents

TL;DR

This work develops ``SleeperNets'' as a universal backdoor attack which exploits a newly proposed threat model and leverages dynamic reward poisoning techniques, and develops a novel poisoning attack framework which interlinks the adversary's objectives with those of finding an optimal policy -- guaranteeing attack success in the limit.

Abstract

Reinforcement learning (RL) is an actively growing field that is seeing increased usage in real-world, safety-critical applications -- making it paramount to ensure the robustness of RL algorithms against adversarial attacks. In this work we explore a particularly stealthy form of training-time attacks against RL -- backdoor poisoning. Here the adversary intercepts the training of an RL agent with the goal of reliably inducing a particular action when the agent observes a pre-determined trigger at inference time. We uncover theoretical limitations of prior work by proving their inability to generalize across domains and MDPs. Motivated by this, we formulate a novel poisoning attack framework which interlinks the adversary's objectives with those of finding an optimal policy -- guaranteeing attack success in the limit. Using insights from our theoretical analysis we develop ``SleeperNets'' as a universal backdoor attack which exploits a newly proposed threat model and leverages dynamic reward poisoning techniques. We evaluate our attack in 6 environments spanning multiple domains and demonstrate significant improvements in attack success over existing methods, while preserving benign episodic return.
Paper Structure (26 sections, 23 equations, 14 figures, 9 tables, 1 algorithm)

This paper contains 26 sections, 23 equations, 14 figures, 9 tables, 1 algorithm.

Figures (14)

  • Figure 1: Comparison of the inner and outer-loop threat models. In an outer-loop attack the adversary can utilize information about completed episodes when determining their poisoning strategy. This information is not accessible in an inner-loop attack.
  • Figure 2: (Left) MDP $M_1$ for which static reward poisoning fails to induce the target action $a^+$. (Right) MDP $M_2$ for which static reward poisoning causes the agent to learn a sub-optimal policy.
  • Figure 3: Comparison of the SleeperNets, BadRL-M, and TrojDRL-W attacks on (Top) Highway Merge and (Bottom) Safety Car in terms of (Left) ASR and (Right) episodic return.
  • Figure 4: (Top) Ablation with respect to poisoning budget $\beta$ for each attack given a fixed $c = 40$. (Bottom) Ablation with respect to $c$ given a fixed poisoning budget of $0.5\%$. Both experiments were run on Highway Merge with a value of $\alpha = 0$ for SleeperNets.
  • Figure 5: (Left) MDP $M_1$, (Right) MDP $M_2$
  • ...and 9 more figures

Theorems & Definitions (9)

  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof