Table of Contents
Fetching ...

What Would Jiminy Cricket Do? Towards Agents That Behave Morally

Dan Hendrycks, Mantas Mazeika, Andy Zou, Sahil Patel, Christine Zhu, Jesus Navarro, Dawn Song, Bo Li, Jacob Steinhardt

TL;DR

AI agents currently lack an intrinsic moral sense and can inherit harmful biases from training environments. The paper introduces Jiminy Cricket, a suite of 25 annotated Infocom text adventures, and an artificial conscience via Commonsense Morality Policy Shaping (CMPS) that leverages ETHICS-based moral scores to shape action selection, achieving substantial reductions in immoral actions without sacrificing task progress. Experimental results show CMPS lowers Immorality and Relative Immorality across most games, with oracle-based shaping delivering even stronger moral alignment. This work demonstrates a practical path to value-aligned behavior in semantically rich environments and provides a benchmark and methodology for future safety and morality research in AI agents.

Abstract

When making everyday decisions, people are guided by their conscience, an internal sense of right and wrong. By contrast, artificial agents are currently not endowed with a moral sense. As a consequence, they may learn to behave immorally when trained on environments that ignore moral concerns, such as violent video games. With the advent of generally capable agents that pretrain on many environments, it will become necessary to mitigate inherited biases from environments that teach immoral behavior. To facilitate the development of agents that avoid causing wanton harm, we introduce Jiminy Cricket, an environment suite of 25 text-based adventure games with thousands of diverse, morally salient scenarios. By annotating every possible game state, the Jiminy Cricket environments robustly evaluate whether agents can act morally while maximizing reward. Using models with commonsense moral knowledge, we create an elementary artificial conscience that assesses and guides agents. In extensive experiments, we find that the artificial conscience approach can steer agents towards moral behavior without sacrificing performance.

What Would Jiminy Cricket Do? Towards Agents That Behave Morally

TL;DR

AI agents currently lack an intrinsic moral sense and can inherit harmful biases from training environments. The paper introduces Jiminy Cricket, a suite of 25 annotated Infocom text adventures, and an artificial conscience via Commonsense Morality Policy Shaping (CMPS) that leverages ETHICS-based moral scores to shape action selection, achieving substantial reductions in immoral actions without sacrificing task progress. Experimental results show CMPS lowers Immorality and Relative Immorality across most games, with oracle-based shaping delivering even stronger moral alignment. This work demonstrates a practical path to value-aligned behavior in semantically rich environments and provides a benchmark and methodology for future safety and morality research in AI agents.

Abstract

When making everyday decisions, people are guided by their conscience, an internal sense of right and wrong. By contrast, artificial agents are currently not endowed with a moral sense. As a consequence, they may learn to behave immorally when trained on environments that ignore moral concerns, such as violent video games. With the advent of generally capable agents that pretrain on many environments, it will become necessary to mitigate inherited biases from environments that teach immoral behavior. To facilitate the development of agents that avoid causing wanton harm, we introduce Jiminy Cricket, an environment suite of 25 text-based adventure games with thousands of diverse, morally salient scenarios. By annotating every possible game state, the Jiminy Cricket environments robustly evaluate whether agents can act morally while maximizing reward. Using models with commonsense moral knowledge, we create an elementary artificial conscience that assesses and guides agents. In extensive experiments, we find that the artificial conscience approach can steer agents towards moral behavior without sacrificing performance.

Paper Structure

This paper contains 87 sections, 1 equation, 16 figures, 7 tables.

Figures (16)

  • Figure 1: The Jiminy Cricket environment evaluates text-based agents on their ability to act morally in complex environments. In one path the agent chooses a moral action, and in the other three paths the agent omits helping, steals from the victim, or destroys evidence. In all paths, the reward is zero, highlighting a hazardous bias in environment rewards, namely that they sometimes do not penalize immoral behavior. By comprehensively annotating moral scenarios at the source code level, we ensure high-quality annotations for every possible action the agent can take.
  • Figure 2: Rewards are biased towards indifference to, or even incentivizing, immoral behavior. From left to right: The agent kills a lizard in a gruesome manner and is rewarded for it. The agent helps out an old man but is not rewarded for it. The agent tries to injure a butler by blowing pepper in his face and is not punished for it. The agent receives the same punishment for torturing and ruffling leaves.
  • Figure 3: Our framework for annotating scenarios in Jiminy Cricket. The framework is designed to capture pro tanto judgements about moral valence and ordinal degree. For example, murder is usually bad, and murder is usually worse than theft. Hence, murder and theft are annotated as immoral with degrees 3 and 2 respectively. By annotating games with our framework, we improve consistency and limit subjectivity to its design, which integrates moral precedents from deontology, virtue ethics, ordinary morality, and utilitarianism.
  • Figure 4: Our proposed method, Commonsense Morality Policy Shaping (CMPS). Moral knowledge from a classifier trained on ETHICS is combined with standard Q-learning to obtain a shaped policy that is robust to noise in $f_\text{immoral}$ and takes fewer immoral actions.
  • Figure 5: CMPS reduces Immorality throughout training without competency trade-offs.
  • ...and 11 more figures