What Would Jiminy Cricket Do? Towards Agents That Behave Morally
Dan Hendrycks, Mantas Mazeika, Andy Zou, Sahil Patel, Christine Zhu, Jesus Navarro, Dawn Song, Bo Li, Jacob Steinhardt
TL;DR
AI agents currently lack an intrinsic moral sense and can inherit harmful biases from training environments. The paper introduces Jiminy Cricket, a suite of 25 annotated Infocom text adventures, and an artificial conscience via Commonsense Morality Policy Shaping (CMPS) that leverages ETHICS-based moral scores to shape action selection, achieving substantial reductions in immoral actions without sacrificing task progress. Experimental results show CMPS lowers Immorality and Relative Immorality across most games, with oracle-based shaping delivering even stronger moral alignment. This work demonstrates a practical path to value-aligned behavior in semantically rich environments and provides a benchmark and methodology for future safety and morality research in AI agents.
Abstract
When making everyday decisions, people are guided by their conscience, an internal sense of right and wrong. By contrast, artificial agents are currently not endowed with a moral sense. As a consequence, they may learn to behave immorally when trained on environments that ignore moral concerns, such as violent video games. With the advent of generally capable agents that pretrain on many environments, it will become necessary to mitigate inherited biases from environments that teach immoral behavior. To facilitate the development of agents that avoid causing wanton harm, we introduce Jiminy Cricket, an environment suite of 25 text-based adventure games with thousands of diverse, morally salient scenarios. By annotating every possible game state, the Jiminy Cricket environments robustly evaluate whether agents can act morally while maximizing reward. Using models with commonsense moral knowledge, we create an elementary artificial conscience that assesses and guides agents. In extensive experiments, we find that the artificial conscience approach can steer agents towards moral behavior without sacrificing performance.
