Strategy Masking: A Method for Guardrails in Value-based Reinforcement Learning Agents
Jonathan Keane, Sam Keyser, Jeremy Kedziora
TL;DR
This paper tackles the problem that reward functions can incentivize undesirable or unethical AI behaviors. It introduces strategy masking, a method that decomposes rewards into multiple factors as $\vec{r}(s',a,s) = \langle r_k(\cdot)\rangle_{k=1}^K$ and couples this with a vector-valued $\vec{Q}_{\pi}(s,a) = \langle Q^{(k)}_{\pi}(s,a)\rangle_{k}$; a masking vector $\vec{m}$ then selects which factors influence decisions via $\vec{Q}\cdot\vec{m}$, enabling both training-time and post-training control of behavior. The approach extends to function approximation, including a masked DQN target using $a^*_{\vec{m}}(s') = \arg\max_a \{\vec{Q}(s',a)\cdot\vec{m}\}$, and provides convergence guarantees for masked $Q$-learning. The authors apply this framework in Coup, a multi-agent, partially observable game, to study lying and its suppression; results show that chosen masks steer behaviors (e.g., lying) as intended and that post-training mask adjustments can reduce undesirable actions without substantially harming win rate. Overall, strategy masking offers a general, scalable mechanism to impose guardrails on reward-based agents, with potential applicability to large-scale AI systems and safety-critical domains.
Abstract
The use of reward functions to structure AI learning and decision making is core to the current reinforcement learning paradigm; however, without careful design of reward functions, agents can learn to solve problems in ways that may be considered "undesirable" or "unethical." Without thorough understanding of the incentives a reward function creates, it can be difficult to impose principled yet general control mechanisms over its behavior. In this paper, we study methods for constructing guardrails for AI agents that use reward functions to learn decision making. We introduce a novel approach, which we call strategy masking, to explicitly learn and then suppress undesirable AI agent behavior. We apply our method to study lying in AI agents and show that it can be used to effectively modify agent behavior by suppressing lying post-training without compromising agent ability to perform effectively.
