Poisoning Deep Reinforcement Learning Agents with In-Distribution Triggers
Chace Ashcraft, Kiran Karra
TL;DR
<3-5 sentence high-level summary> The paper addresses the vulnerability of deep reinforcement learning agents to data poisoning backdoors delivered via triggers embedded in training data. It introduces in-distribution triggers and a multitask training paradigm to embed trojaned behavior, demonstrated in three RL environments. The work formalizes triggers within MDP/POMDP observations and provides concrete examples in Atari Boxing, Parameterized LavaWorld, and Pursuit, including quantitative results under trigger conditions. The findings suggest in-distribution triggers are harder to detect and pose significant security challenges, underscoring the need for defense research in DRL pipelines.
Abstract
In this paper, we propose a new data poisoning attack and apply it to deep reinforcement learning agents. Our attack centers on what we call in-distribution triggers, which are triggers native to the data distributions the model will be trained on and deployed in. We outline a simple procedure for embedding these, and other, triggers in deep reinforcement learning agents following a multi-task learning paradigm, and demonstrate in three common reinforcement learning environments. We believe that this work has important implications for the security of deep learning models.
