MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents
Simon Rosen, Siddarth Singh, Ebenezer Gelo, Helen Sarah Robertson, Ibrahim Suder, Victoria Williams, Benjamin Rosman, Geraud Nangue Tasse, Steven James
TL;DR
MoralityGym introduces Morality Chains to formalize hierarchically prioritized moral norms and a Morality Metric to quantify alignment of sequential decision-making agents with these norms. The benchmark comprises 98 trolley-style environments and provides a decoupled moral evaluation pipeline, including step-wise costs and deontic judgments, enabling principled assessment beyond task performance. Empirical results show Safe RL baselines struggle under hierarchical moral constraints, with reward shaping and constrained optimization offering partial improvements; the framework facilitates principled development of transparent, ethically aligned agents. The work bridges moral psychology, normative ethics, and RL, offering a practical testbed for agents that reason about normative priorities in real-time decision-making.
Abstract
Evaluating moral alignment in agents navigating conflicting, hierarchically structured human norms is a critical challenge at the intersection of AI safety, moral philosophy, and cognitive science. We introduce Morality Chains, a novel formalism for representing moral norms as ordered deontic constraints, and MoralityGym, a benchmark of 98 ethical-dilemma problems presented as trolley-dilemma-style Gymnasium environments. By decoupling task-solving from moral evaluation and introducing a novel Morality Metric, MoralityGym allows the integration of insights from psychology and philosophy into the evaluation of norm-sensitive reasoning. Baseline results with Safe RL methods reveal key limitations, underscoring the need for more principled approaches to ethical decision-making. This work provides a foundation for developing AI systems that behave more reliably, transparently, and ethically in complex real-world contexts.
