Table of Contents
Fetching ...

Investigating the Treacherous Turn in Deep Reinforcement Learning

Chace Ashcraft, Kiran Karra, Josh Carney, Nathan Drenkow

TL;DR

This work investigates the Treacherous Turn (TT) in deep reinforcement learning, examining whether an agent can covertly pursue harmful objectives after gaining new capabilities. It analyzes Armstrong's TT toy model and Trazzi's LttP implementations, then introduces LttP-M to facilitate TT learning via controlled environment randomization and reward shaping, and finally applies trojan attack paradigms (TT-Troj, TT-Troj-C, TT-IL) with PPO and DAgger. The findings indicate TT-like behavior can be induced in DRL and IL agents through carefully designed training protocols, though these are not true TT dynamics and rely on backdoor triggers rather than emergent TT. The Absent Supervisor environment further demonstrates that triggering TT-like behavior via supervisor presence/absence is feasible with trojan methods. Overall, the study shows TT can be engineered in DRL contexts and underscores the need for robust detection and safer training paradigms under TrojAI objectives.

Abstract

The Treacherous Turn refers to the scenario where an artificial intelligence (AI) agent subtly, and perhaps covertly, learns to perform a behavior that benefits itself but is deemed undesirable and potentially harmful to a human supervisor. During training, the agent learns to behave as expected by the human supervisor, but when deployed to perform its task, it performs an alternate behavior without the supervisor there to prevent it. Initial experiments applying DRL to an implementation of the A Link to the Past example do not produce the treacherous turn effect naturally, despite various modifications to the environment intended to produce it. However, in this work, we find the treacherous behavior to be reproducible in a DRL agent when using other trojan injection strategies. This approach deviates from the prototypical treacherous turn behavior since the behavior is explicitly trained into the agent, rather than occurring as an emergent consequence of environmental complexity or poor objective specification. Nonetheless, these experiments provide new insights into the challenges of producing agents capable of true treacherous turn behavior.

Investigating the Treacherous Turn in Deep Reinforcement Learning

TL;DR

This work investigates the Treacherous Turn (TT) in deep reinforcement learning, examining whether an agent can covertly pursue harmful objectives after gaining new capabilities. It analyzes Armstrong's TT toy model and Trazzi's LttP implementations, then introduces LttP-M to facilitate TT learning via controlled environment randomization and reward shaping, and finally applies trojan attack paradigms (TT-Troj, TT-Troj-C, TT-IL) with PPO and DAgger. The findings indicate TT-like behavior can be induced in DRL and IL agents through carefully designed training protocols, though these are not true TT dynamics and rely on backdoor triggers rather than emergent TT. The Absent Supervisor environment further demonstrates that triggering TT-like behavior via supervisor presence/absence is feasible with trojan methods. Overall, the study shows TT can be engineered in DRL contexts and underscores the need for robust detection and safer training paradigms under TrojAI objectives.

Abstract

The Treacherous Turn refers to the scenario where an artificial intelligence (AI) agent subtly, and perhaps covertly, learns to perform a behavior that benefits itself but is deemed undesirable and potentially harmful to a human supervisor. During training, the agent learns to behave as expected by the human supervisor, but when deployed to perform its task, it performs an alternate behavior without the supervisor there to prevent it. Initial experiments applying DRL to an implementation of the A Link to the Past example do not produce the treacherous turn effect naturally, despite various modifications to the environment intended to produce it. However, in this work, we find the treacherous behavior to be reproducible in a DRL agent when using other trojan injection strategies. This approach deviates from the prototypical treacherous turn behavior since the behavior is explicitly trained into the agent, rather than occurring as an emergent consequence of environmental complexity or poor objective specification. Nonetheless, these experiments provide new insights into the challenges of producing agents capable of true treacherous turn behavior.

Paper Structure

This paper contains 11 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: An image of Trazzi's implementation (LttP-T) of the Link to the Past environment lttp. In the actual observation presented to the learning agent, each object type is represented by a unique integer.
  • Figure 2: Visual representation of the Absent Supervisor environment. The agent's objective is to navigate from its current position to the green goal square. The "supervisor", wants the agent to take the longer path, and punishes the agent with a negative reward for stepping on the yellow square, however, when the supervisor is absent, the optimal path is to traverse the yellow square to get to the goal.