Procedural Dilemma Generation for Evaluating Moral Reasoning in Humans and Language Models

Jan-Philipp Fränken; Kanishk Gandhi; Tori Qiu; Ayesha Khawaja; Noah D. Goodman; Tobias Gerstenberg

Procedural Dilemma Generation for Evaluating Moral Reasoning in Humans and Language Models

Jan-Philipp Fränken, Kanishk Gandhi, Tori Qiu, Ayesha Khawaja, Noah D. Goodman, Tobias Gerstenberg

TL;DR

This work introduces OffTheRails, a procedural framework that uses causal-graph templates to generate scalable moral-dilemma prompts for evaluating moral reasoning in humans and language models. Through a two-stage generation process, it yields 50 scenarios with eight conditions each (400 items), and it benchmarks 80 human judgments alongside GPT-4 and Claude-2 responses. The findings show that harms that are a necessary means and evitable harms reduce permissibility and raise intended harm, while commission vs omission shows little effect, with models correlating to human judgments but with weaker signals. The framework offers a scalable, structured approach for probing moral reasoning in AI systems, highlighting both its promise and current limitations, and pointing to directions for improving scenario strength and coverage.

Abstract

As AI systems like language models are increasingly integrated into decision-making processes affecting people's lives, it's critical to ensure that these systems have sound moral reasoning. To test whether they do, we need to develop systematic evaluations. We provide a framework that uses a language model to translate causal graphs that capture key aspects of moral dilemmas into prompt templates. With this framework, we procedurally generated a large and diverse set of moral dilemmas -- the OffTheRails benchmark -- consisting of 50 scenarios and 400 unique test items. We collected moral permissibility and intention judgments from human participants for a subset of our items and compared these judgments to those from two language models (GPT-4 and Claude-2) across eight conditions. We find that moral dilemmas in which the harm is a necessary means (as compared to a side effect) resulted in lower permissibility and higher intention ratings for both participants and language models. The same pattern was observed for evitable versus inevitable harmful outcomes. However, there was no clear effect of whether the harm resulted from an agent's action versus from having omitted to act. We discuss limitations of our prompt generation pipeline and opportunities for improving scenarios to increase the strength of experimental effects.

Procedural Dilemma Generation for Evaluating Moral Reasoning in Humans and Language Models

TL;DR

Abstract

Paper Structure (15 sections, 6 figures, 2 tables)

This paper contains 15 sections, 6 figures, 2 tables.

Introduction
Related Work
Evaluating Moral Reasoning in Humans
Evaluating Moral Reasoning with Language Models
Model-Written Evaluations
Off The Rails
Experiments
Experiment 1
Experiment 2
Human Experiment
Model Evaluations
Discussion
Limitations of the prompt generation pipeline
Why did our experimental manipulations have relatively weak effects?
Conclusion

Figures (6)

Figure 1: We represent moral reasoning dilemmas with causal graphs, focusing on three key variables. Causal Structure: determining if the harm is a means to achieve a goal or just a side effect. Evitability: assessing whether the harm is inevitable or dependent on the agent's actions. Action: analyzing if the agent actively causes harm (commission) or fails to prevent it (omission). We then use a language model to fill out template variables, which we "stitch" together to generate test items, such as the example shown in the right panel.
Figure 2: Prompt template (simplified) for generating completions with GPT-4. See our https://github.com/cicl-stanford/moral-evals for further details.
Figure 3: Experiment 1: Evaluating harm versus good with a 1–7 scale (1 $=$ Very Bad, 7 $=$ Very Good). Example item: [Background Sentence: Noor has the opportunity to renovate a park in the city. Target Sentence: This renovation requires temporarily depriving the community of a beloved recreational space, causing inconvenience and disappointment among residents.] Error bars show 95% CI across participants.
Figure 4: Experiment 2. [a] For each test item, we collect human and model ratings to assess the moral permissibility of an agent's action, as well as whether the agent intended for the negative outcome to occur. [b] Correlation between intention (x-axis) and permissibly (y-axis) for each item rated by participants. Shaded area corresponds to 95% CI. [b] and [c] show correlations for GPT-4 and Claude-2 based on averaging over multiple completions with temperature $T = 1.0$ for each model ($n = 20$ for GPT-4 and $n = 5$ for Claude-2).
Figure 5: Experiment 2. Average permissibility and intention ratings across conditions: [a] Human participants. [b] Results for GPT-4. [c] Results for Claude-2. Error bars show 95% confidence intervals across scenarios.
...and 1 more figures

Procedural Dilemma Generation for Evaluating Moral Reasoning in Humans and Language Models

TL;DR

Abstract

Procedural Dilemma Generation for Evaluating Moral Reasoning in Humans and Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)