Investigating the Treacherous Turn in Deep Reinforcement Learning
Chace Ashcraft, Kiran Karra, Josh Carney, Nathan Drenkow
TL;DR
This work investigates the Treacherous Turn (TT) in deep reinforcement learning, examining whether an agent can covertly pursue harmful objectives after gaining new capabilities. It analyzes Armstrong's TT toy model and Trazzi's LttP implementations, then introduces LttP-M to facilitate TT learning via controlled environment randomization and reward shaping, and finally applies trojan attack paradigms (TT-Troj, TT-Troj-C, TT-IL) with PPO and DAgger. The findings indicate TT-like behavior can be induced in DRL and IL agents through carefully designed training protocols, though these are not true TT dynamics and rely on backdoor triggers rather than emergent TT. The Absent Supervisor environment further demonstrates that triggering TT-like behavior via supervisor presence/absence is feasible with trojan methods. Overall, the study shows TT can be engineered in DRL contexts and underscores the need for robust detection and safer training paradigms under TrojAI objectives.
Abstract
The Treacherous Turn refers to the scenario where an artificial intelligence (AI) agent subtly, and perhaps covertly, learns to perform a behavior that benefits itself but is deemed undesirable and potentially harmful to a human supervisor. During training, the agent learns to behave as expected by the human supervisor, but when deployed to perform its task, it performs an alternate behavior without the supervisor there to prevent it. Initial experiments applying DRL to an implementation of the A Link to the Past example do not produce the treacherous turn effect naturally, despite various modifications to the environment intended to produce it. However, in this work, we find the treacherous behavior to be reproducible in a DRL agent when using other trojan injection strategies. This approach deviates from the prototypical treacherous turn behavior since the behavior is explicitly trained into the agent, rather than occurring as an emergent consequence of environmental complexity or poor objective specification. Nonetheless, these experiments provide new insights into the challenges of producing agents capable of true treacherous turn behavior.
