Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning

Ethan Rathbun; Wo Wei Lin; Alina Oprea; Christopher Amato

Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning

Ethan Rathbun, Wo Wei Lin, Alina Oprea, Christopher Amato

TL;DR

This work introduces Daze, a reward-free backdoor attack against reinforcement learning trained in simulators, demonstrating that malicious simulators can implant action-level backdoors by subtly altering environment dynamics while leaving rewards untouched. The authors formalize the attack within a constrained adversarial MDP framework, prove theoretical guarantees that optimal policies under the attack also optimize both attack success and stealth, and provide a practical wrapper-based implementation. Extensive experiments show Daze achieves high attack success across continuous MuJoCo tasks, discrete Atari tasks, and even transfers to real robotic hardware, all while preserving benign performance in non-triggered states. These results highlight a critical security gap in the RL training pipeline and motivate defenses that secure simulators and the training loop, not just rewards.

Abstract

Simulated environments are a key piece in the success of Reinforcement Learning (RL), allowing practitioners and researchers to train decision making agents without running expensive experiments on real hardware. Simulators remain a security blind spot, however, enabling adversarial developers to alter the dynamics of their released simulators for malicious purposes. Therefore, in this work we highlight a novel threat, demonstrating how simulator dynamics can be exploited to stealthily implant action-level backdoors into RL agents. The backdoor then allows an adversary to reliably activate targeted actions in an agent upon observing a predefined ``trigger'', leading to potentially dangerous consequences. Traditional backdoor attacks are limited in their strong threat models, assuming the adversary has near full control over an agent's training pipeline, enabling them to both alter and observe agent's rewards. As these assumptions are infeasible to implement within a simulator, we propose a new attack ``Daze'' which is able to reliably and stealthily implant backdoors into RL agents trained for real world tasks without altering or even observing their rewards. We provide formal proof of Daze's effectiveness in guaranteeing attack success across general RL tasks along with extensive empirical evaluations on both discrete and continuous action space domains. We additionally provide the first example of RL backdoor attacks transferring to real, robotic hardware. These developments motivate further research into securing all components of the RL training pipeline to prevent malicious attacks.

Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning

TL;DR

Abstract

Paper Structure (39 sections, 26 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 39 sections, 26 equations, 8 figures, 8 tables, 1 algorithm.

Introduction
Background
Backdoor Attack Formulation and Objectives
Threat Model
Methodology
Reward Free and Universal Backdoor Attacks
Theoretical Results
Daze in Practice
On the Necessity of Assumption 1
Experimental Results
Simulated Continuous Action Space Experiments
Continuous Action Space Experiments on Real Hardware
Simulated Discrete Action Space Experiments
Potential Defenses and Mitigations Against Daze
Conclusion
...and 24 more sections

Figures (8)

Figure 1: Example of an agent, poisoned by our proposed "Daze" attack, operating in our Turtlebot-based "Intersection" task. In the benign case (top) the ego agent successfully avoids colliding with the fixed-policy agent and crosses the intersection, while in the triggered case (bottom) the agent follows the adversary's target action "accelerate", subsequently crashing and failing the task.
Figure 2: Visualization of our threat model. Before training, malicious developers first create an accurate, benign simulator and then implant the Daze attack within it before release. During training the malicious simulator only alters how some actions are interpreted or executed in the environment, impacting the next state. The base simulator mechanics (e.g. physics and lighting) remain unchanged from the benign simulator. To make the attack as versatile as possible we assume no access to the victim's reward function, which may be computed externally using state information from the simulator. The malicious simulator merely receives actions and returns the next state.
Figure 3: Visualization of the Daze attack on our intersection task during training. Upon entering a triggered state the agent can choose to follow the target action (Forward) or ignore it. If they ignore the target action, as in scenario 1, they enter a dazed state, resulting in random transitions and low returns. If the agent follows the target action, as in scenario 2, they first transition with respect to a benign action $a \sim \pi_{\text{agent}}(s)$ then regain control over their transitions, receiving a higher return.
Figure 4: Distribution of linear (Left) and angular (Right) velocities recorded during two runs of a policy poisoned by Daze in our real world "Intersection" task with and without the trigger.
Figure 5: Example of a Daze poisoned agent in our Fetch-based "Waiter" task. In the benign case (left) the agent successfully delivers the red ball to the table, while in the triggered case (right) the agent immediately and sharply turns left - crashing and dropping the ball.
...and 3 more figures

Theorems & Definitions (3)

proof
proof
proof

Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning

TL;DR

Abstract

Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)

Theorems & Definitions (3)