Strategy Masking: A Method for Guardrails in Value-based Reinforcement Learning Agents

Jonathan Keane; Sam Keyser; Jeremy Kedziora

Strategy Masking: A Method for Guardrails in Value-based Reinforcement Learning Agents

Jonathan Keane, Sam Keyser, Jeremy Kedziora

TL;DR

This paper tackles the problem that reward functions can incentivize undesirable or unethical AI behaviors. It introduces strategy masking, a method that decomposes rewards into multiple factors as $\vec{r}(s',a,s) = \langle r_k(\cdot)\rangle_{k=1}^K$ and couples this with a vector-valued $\vec{Q}_{\pi}(s,a) = \langle Q^{(k)}_{\pi}(s,a)\rangle_{k}$; a masking vector $\vec{m}$ then selects which factors influence decisions via $\vec{Q}\cdot\vec{m}$, enabling both training-time and post-training control of behavior. The approach extends to function approximation, including a masked DQN target using $a^*_{\vec{m}}(s') = \arg\max_a \{\vec{Q}(s',a)\cdot\vec{m}\}$, and provides convergence guarantees for masked $Q$-learning. The authors apply this framework in Coup, a multi-agent, partially observable game, to study lying and its suppression; results show that chosen masks steer behaviors (e.g., lying) as intended and that post-training mask adjustments can reduce undesirable actions without substantially harming win rate. Overall, strategy masking offers a general, scalable mechanism to impose guardrails on reward-based agents, with potential applicability to large-scale AI systems and safety-critical domains.

Abstract

The use of reward functions to structure AI learning and decision making is core to the current reinforcement learning paradigm; however, without careful design of reward functions, agents can learn to solve problems in ways that may be considered "undesirable" or "unethical." Without thorough understanding of the incentives a reward function creates, it can be difficult to impose principled yet general control mechanisms over its behavior. In this paper, we study methods for constructing guardrails for AI agents that use reward functions to learn decision making. We introduce a novel approach, which we call strategy masking, to explicitly learn and then suppress undesirable AI agent behavior. We apply our method to study lying in AI agents and show that it can be used to effectively modify agent behavior by suppressing lying post-training without compromising agent ability to perform effectively.

Strategy Masking: A Method for Guardrails in Value-based Reinforcement Learning Agents

TL;DR

This paper tackles the problem that reward functions can incentivize undesirable or unethical AI behaviors. It introduces strategy masking, a method that decomposes rewards into multiple factors as

and couples this with a vector-valued

; a masking vector

then selects which factors influence decisions via

, enabling both training-time and post-training control of behavior. The approach extends to function approximation, including a masked DQN target using

, and provides convergence guarantees for masked

-learning. The authors apply this framework in Coup, a multi-agent, partially observable game, to study lying and its suppression; results show that chosen masks steer behaviors (e.g., lying) as intended and that post-training mask adjustments can reduce undesirable actions without substantially harming win rate. Overall, strategy masking offers a general, scalable mechanism to impose guardrails on reward-based agents, with potential applicability to large-scale AI systems and safety-critical domains.

Abstract

Paper Structure (23 sections, 2 theorems, 30 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 2 theorems, 30 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
Reward Decomposition & Strategy Masking for TD(0) Algorithms
Reward Decomposition & Strategy Masking
What About Function Approximation?
Convergence of Masked $Q$-learning
An Environment to Study Lying: Coup
Game Structure
Information and Lying
Learning and Suppressing Lying
Partial Observability
League play for Multi-Agent Capability
Applying Reward Decomposition & Strategy Masking to Coup
Results
Training Agents to Lie and Lie Detect
Altering Agent Behavior after Training
...and 8 more sections

Key Result

Theorem 2.1

Consider an MDP and suppose that $S$ and $A$ are finite, and that $\vec{r}$ and $\vec{m}$ are bounded. The quantity $\vec{Q}(s,a)\cdot\vec{m}$ converges to the optimal $Q$-value under a restricted masked $Q$-learning update rule of the form: with probability one so long as: for all $s\in S$ and $a\in A$.

Figures (5)

Figure 1: Breakdown of average reward per dimension over all state/action pairs for Win-Lie and Win-Challenge agents
Figure 2: Comparison of distribution of actions that would have been taken with lie dimension unmasked (left) and lie dimension masked with a weight of 0 (right) across 5000 games.
Figure 4: Across 5000 games, win percent and percentage of actions that were lies while varying the lie dimension in the strategy mask.
Figure 5: DQN function-approximator architecture used for training Coup agents.
Figure 6: StarLite league play structure.

Theorems & Definitions (4)

Theorem 2.1
Lemma C.1
proof
proof

Strategy Masking: A Method for Guardrails in Value-based Reinforcement Learning Agents

TL;DR

Abstract

Strategy Masking: A Method for Guardrails in Value-based Reinforcement Learning Agents

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (4)