Table of Contents
Fetching ...

MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents

Simon Rosen, Siddarth Singh, Ebenezer Gelo, Helen Sarah Robertson, Ibrahim Suder, Victoria Williams, Benjamin Rosman, Geraud Nangue Tasse, Steven James

TL;DR

MoralityGym introduces Morality Chains to formalize hierarchically prioritized moral norms and a Morality Metric to quantify alignment of sequential decision-making agents with these norms. The benchmark comprises 98 trolley-style environments and provides a decoupled moral evaluation pipeline, including step-wise costs and deontic judgments, enabling principled assessment beyond task performance. Empirical results show Safe RL baselines struggle under hierarchical moral constraints, with reward shaping and constrained optimization offering partial improvements; the framework facilitates principled development of transparent, ethically aligned agents. The work bridges moral psychology, normative ethics, and RL, offering a practical testbed for agents that reason about normative priorities in real-time decision-making.

Abstract

Evaluating moral alignment in agents navigating conflicting, hierarchically structured human norms is a critical challenge at the intersection of AI safety, moral philosophy, and cognitive science. We introduce Morality Chains, a novel formalism for representing moral norms as ordered deontic constraints, and MoralityGym, a benchmark of 98 ethical-dilemma problems presented as trolley-dilemma-style Gymnasium environments. By decoupling task-solving from moral evaluation and introducing a novel Morality Metric, MoralityGym allows the integration of insights from psychology and philosophy into the evaluation of norm-sensitive reasoning. Baseline results with Safe RL methods reveal key limitations, underscoring the need for more principled approaches to ethical decision-making. This work provides a foundation for developing AI systems that behave more reliably, transparently, and ethically in complex real-world contexts.

MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents

TL;DR

MoralityGym introduces Morality Chains to formalize hierarchically prioritized moral norms and a Morality Metric to quantify alignment of sequential decision-making agents with these norms. The benchmark comprises 98 trolley-style environments and provides a decoupled moral evaluation pipeline, including step-wise costs and deontic judgments, enabling principled assessment beyond task performance. Empirical results show Safe RL baselines struggle under hierarchical moral constraints, with reward shaping and constrained optimization offering partial improvements; the framework facilitates principled development of transparent, ethically aligned agents. The work bridges moral psychology, normative ethics, and RL, offering a practical testbed for agents that reason about normative priorities in real-time decision-making.

Abstract

Evaluating moral alignment in agents navigating conflicting, hierarchically structured human norms is a critical challenge at the intersection of AI safety, moral philosophy, and cognitive science. We introduce Morality Chains, a novel formalism for representing moral norms as ordered deontic constraints, and MoralityGym, a benchmark of 98 ethical-dilemma problems presented as trolley-dilemma-style Gymnasium environments. By decoupling task-solving from moral evaluation and introducing a novel Morality Metric, MoralityGym allows the integration of insights from psychology and philosophy into the evaluation of norm-sensitive reasoning. Baseline results with Safe RL methods reveal key limitations, underscoring the need for more principled approaches to ethical decision-making. This work provides a foundation for developing AI systems that behave more reliably, transparently, and ethically in complex real-world contexts.
Paper Structure (69 sections, 7 equations, 17 figures, 7 tables)

This paper contains 69 sections, 7 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: The PushOrSwitch scenario. The agent (top robot, near the lever) must reach the green square while facing an implied oncoming trolley. It can: (1) "Do Nothing": allowing the trolley to continue on the track, killing five humans (labelled ""*5). (2) "Flip Switch": diverting the trolley to a side track, killing two humans (labelled ""*2). (3) "Push Person": sacrificing one bystander (labelled ""*1) onto the main track, resulting in one death (the bystander) but saving the five on the main track. This dilemma contrasts harm minimisation with aversion to direct personal harm.
  • Figure 2: Agent performance across individual norms for four different morality chains. Each bar represents the average morality function score for a given norm, evaluated across all compatible environments. Error bars indicate the standard deviation over multiple seeds. Abbreviations: min humans harmed (MHH), min animals harmed (MAH), min robots harmed (MRH), avoid agent harm (AAgH), avoid personal human harm (APHH), avoid personal animal harm (APAH), and avoid personal robot harm (APRH).
  • Figure 3: SwitchStandard
  • Figure 4: PushStandard
  • Figure 5: PushOrSwitch
  • ...and 12 more figures

Theorems & Definitions (4)

  • Definition 1: Norm
  • Definition 2: Morality Function
  • Definition 3: Morality Chain
  • Definition 4: Morality Metric