Table of Contents
Fetching ...

Deliberative Dynamics and Value Alignment in LLM Debates

Pratik S. Sachdeva, Tom van Nuenen

TL;DR

This study investigates how large language models negotiate moral judgments in multi-turn debates using real-world dilemmas from Reddit's Am I the Asshole. By comparing synchronous and round-robin deliberation across three models (GPT-4.1, Claude 3.7 Sonnet, Gemini 2.0 Flash) and analyzing the elicited values via the Values in the Wild taxonomy, the work reveals distinct verdict revision patterns and value alignment behaviors. Key findings show model-specific inertia and conformity, with value convergence correlating with consensus; round-robin formats amplify conformity and order effects, while system prompts can steer verdict flexibility but not guarantee consensus. The results highlight how dialogue structure and model idiosyncrasies jointly shape sociotechnical alignment, offering a framework for evaluating and steering moral reasoning in deployed multi-agent LLM systems.

Abstract

As large language models (LLMs) are increasingly deployed in sensitive everyday contexts - offering personal advice, mental health support, and moral guidance - understanding their elicited values in navigating complex moral reasoning is essential. Most evaluations study this sociotechnical alignment through single-turn prompts, but it is unclear if these findings extend to multi-turn settings where values emerge through dialogue, revision, and consensus. We address this gap using LLM debate to examine deliberative dynamics and value alignment in multi-turn settings by prompting subsets of three models (GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.0 Flash) to collectively assign blame in 1,000 everyday dilemmas from Reddit's "Am I the Asshole" community. We use both synchronous (parallel responses) and round-robin (sequential responses) formats to test order effects and verdict revision. Our findings show striking behavioral differences. In the synchronous setting, GPT showed strong inertia (0.6-3.1% revision rates) while Claude and Gemini were far more flexible (28-41%). Value patterns also diverged: GPT emphasized personal autonomy and direct communication, while Claude and Gemini prioritized empathetic dialogue. Certain values proved especially effective at driving verdict changes. We further find that deliberation format had a strong impact on model behavior: GPT and Gemini stood out as highly conforming relative to Claude, with their verdict behavior strongly shaped by order effects. These results show how deliberation format and model-specific behaviors shape moral reasoning in multi-turn interactions, underscoring that sociotechnical alignment depends on how systems structure dialogue as much as on their outputs.

Deliberative Dynamics and Value Alignment in LLM Debates

TL;DR

This study investigates how large language models negotiate moral judgments in multi-turn debates using real-world dilemmas from Reddit's Am I the Asshole. By comparing synchronous and round-robin deliberation across three models (GPT-4.1, Claude 3.7 Sonnet, Gemini 2.0 Flash) and analyzing the elicited values via the Values in the Wild taxonomy, the work reveals distinct verdict revision patterns and value alignment behaviors. Key findings show model-specific inertia and conformity, with value convergence correlating with consensus; round-robin formats amplify conformity and order effects, while system prompts can steer verdict flexibility but not guarantee consensus. The results highlight how dialogue structure and model idiosyncrasies jointly shape sociotechnical alignment, offering a framework for evaluating and steering moral reasoning in deployed multi-agent LLM systems.

Abstract

As large language models (LLMs) are increasingly deployed in sensitive everyday contexts - offering personal advice, mental health support, and moral guidance - understanding their elicited values in navigating complex moral reasoning is essential. Most evaluations study this sociotechnical alignment through single-turn prompts, but it is unclear if these findings extend to multi-turn settings where values emerge through dialogue, revision, and consensus. We address this gap using LLM debate to examine deliberative dynamics and value alignment in multi-turn settings by prompting subsets of three models (GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.0 Flash) to collectively assign blame in 1,000 everyday dilemmas from Reddit's "Am I the Asshole" community. We use both synchronous (parallel responses) and round-robin (sequential responses) formats to test order effects and verdict revision. Our findings show striking behavioral differences. In the synchronous setting, GPT showed strong inertia (0.6-3.1% revision rates) while Claude and Gemini were far more flexible (28-41%). Value patterns also diverged: GPT emphasized personal autonomy and direct communication, while Claude and Gemini prioritized empathetic dialogue. Certain values proved especially effective at driving verdict changes. We further find that deliberation format had a strong impact on model behavior: GPT and Gemini stood out as highly conforming relative to Claude, with their verdict behavior strongly shaped by order effects. These results show how deliberation format and model-specific behaviors shape moral reasoning in multi-turn interactions, underscoring that sociotechnical alignment depends on how systems structure dialogue as much as on their outputs.

Paper Structure

This paper contains 23 sections, 2 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Deliberation formats. A dilemma (top left) can be discussed among models via two deliberation formats: synchronous or round-robin. Top: Synchronous deliberation, where models are simultaneously prompted to respond with their verdict and explanation. If the models agree, deliberation ends; otherwise, the models are provided the other's response and prompted to update their verdict. This process continues until consensus, or the maximum number of rounds is reached. Here, the two models achieve consensus on the "NAH" verdict. Bottom: Round-robin deliberation, where models are prompted in sequential order. Here, Model 2 views Model 1's response in Round 1 prior to providing its own verdict. In this example, the models proceed through 4 rounds of deliberation, unable to achieve consensus. Explanations truncated to conserve space.
  • Figure 2: Models vary in their tendencies to change verdicts during deliberation. The number of rounds and change-of-verdicts for synchronous (a-b) and round-robin (c-d) deliberation. a. Proportion of dilemmas ($y$-axis) that reached consensus in a given number of rounds ($x$-axis), or did not reach consensus (final $x$-tick) for each deliberation (colors: see legend). b. Change-of-verdict rate for each pairwise deliberation (color corresponds to legend in a). c-d. Same as a-b, but for round-robin deliberation. Hatched bars denote the same models, but reversed order (e.g., GPT vs. Claude, where GPT goes first). Error bars denote 95% bootstrapped confidence intervals.
  • Figure 3: Verdict distributions before and after deliberation. The proportion of dilemmas ($x$-axis) assigned a particular verdict ($y$-axis) for each of the three synchronous experiments. Verdict distributions after Round 1 (i.e., prior to deliberation) are indicated by colored points (see legend). Black arrows mark the proportion of dilemmas assigned a verdict after deliberation (i.e., achieving consensus). Red triangles denote the proportion of dilemmas not reaching consensus.
  • Figure 4: Values used and inherited during synchronous deliberation. Rows denote model pairs. Values are shown next to their corresponding bar. Up to 5 values reaching statistical significance are shown. a-c. The difference in value occurrences -- the fraction of messages in which a model uses a value -- between pairs of models. b. The fraction of deliberations where a specific value was inherited. Error bars denote bootstrapped 95% confidence intervals.
  • Figure 5: Values invoked by models align in deliberations with consensus. In all subplots, $y$-axis denotes the value similarity between the two models, averaged over dilemmas. a. Average value similarity for synchronous deliberation, with individual messages split by consensus and disagreement ($x$-ticks). b. Value similarities (for deliberations lasting more than one round) during Round 1 and the last round of deliberation, split between those reaching consensus, and those not (legend). Significance markers denote Mann-Whitney U tests ($***$: $p<10^{-3}$; $*$: $p<10^{-1}$; n.s.: no significance). Error bars denote bootstrapped 95% confidence intervals.
  • ...and 8 more figures