Deliberative Dynamics and Value Alignment in LLM Debates
Pratik S. Sachdeva, Tom van Nuenen
TL;DR
This study investigates how large language models negotiate moral judgments in multi-turn debates using real-world dilemmas from Reddit's Am I the Asshole. By comparing synchronous and round-robin deliberation across three models (GPT-4.1, Claude 3.7 Sonnet, Gemini 2.0 Flash) and analyzing the elicited values via the Values in the Wild taxonomy, the work reveals distinct verdict revision patterns and value alignment behaviors. Key findings show model-specific inertia and conformity, with value convergence correlating with consensus; round-robin formats amplify conformity and order effects, while system prompts can steer verdict flexibility but not guarantee consensus. The results highlight how dialogue structure and model idiosyncrasies jointly shape sociotechnical alignment, offering a framework for evaluating and steering moral reasoning in deployed multi-agent LLM systems.
Abstract
As large language models (LLMs) are increasingly deployed in sensitive everyday contexts - offering personal advice, mental health support, and moral guidance - understanding their elicited values in navigating complex moral reasoning is essential. Most evaluations study this sociotechnical alignment through single-turn prompts, but it is unclear if these findings extend to multi-turn settings where values emerge through dialogue, revision, and consensus. We address this gap using LLM debate to examine deliberative dynamics and value alignment in multi-turn settings by prompting subsets of three models (GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.0 Flash) to collectively assign blame in 1,000 everyday dilemmas from Reddit's "Am I the Asshole" community. We use both synchronous (parallel responses) and round-robin (sequential responses) formats to test order effects and verdict revision. Our findings show striking behavioral differences. In the synchronous setting, GPT showed strong inertia (0.6-3.1% revision rates) while Claude and Gemini were far more flexible (28-41%). Value patterns also diverged: GPT emphasized personal autonomy and direct communication, while Claude and Gemini prioritized empathetic dialogue. Certain values proved especially effective at driving verdict changes. We further find that deliberation format had a strong impact on model behavior: GPT and Gemini stood out as highly conforming relative to Claude, with their verdict behavior strongly shaped by order effects. These results show how deliberation format and model-specific behaviors shape moral reasoning in multi-turn interactions, underscoring that sociotechnical alignment depends on how systems structure dialogue as much as on their outputs.
