Table of Contents
Fetching ...

Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models

Chenchen Yuan, Zheyu Zhang, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci

TL;DR

This work addresses divergent moral judgments across multiple LLMs when confronting complex dilemmas by introducing a probabilistic, reliability-weighted aggregator that fuses continuous moral-acceptability scores $a_{m,j,i}$ into a collective probability $\gamma_{j,i}$. The core method uses a truncated-normal EM framework to learn model reliabilities and consensus, while a targeted embedding-optimization procedure adjusts token embeddings for misaligned moral theories to minimize JS divergence to the consensus. Validation on 42,501 AITA-derived moral dilemmas shows the aggregator yields coherent collective opinions and that theory-token embedding refinements can substantially improve alignment for underperforming models, providing a data-driven path to safer, more consistent multi-LLM moral reasoning. The approach highlights how cross-model aggregation and localized representation edits can stabilize moral judgments in AI systems, with implications for alignment and governance in subjective domains.

Abstract

Large Language Models (LLMs) have shown impressive moral reasoning abilities. Yet they often diverge when confronted with complex, multi-factor moral dilemmas. To address these discrepancies, we propose a framework that synthesizes multiple LLMs' moral judgments into a collectively formulated moral judgment, realigning models that deviate significantly from this consensus. Our aggregation mechanism fuses continuous moral acceptability scores (beyond binary labels) into a collective probability, weighting contributions by model reliability. For misaligned models, a targeted embedding-optimization procedure fine-tunes token embeddings for moral philosophical theories, minimizing JS divergence to the consensus while preserving semantic integrity. Experiments on a large-scale social moral dilemma dataset show our approach builds robust consensus and improves individual model fidelity. These findings highlight the value of data-driven moral alignment across multiple models and its potential for safer, more consistent AI systems.

Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models

TL;DR

This work addresses divergent moral judgments across multiple LLMs when confronting complex dilemmas by introducing a probabilistic, reliability-weighted aggregator that fuses continuous moral-acceptability scores into a collective probability . The core method uses a truncated-normal EM framework to learn model reliabilities and consensus, while a targeted embedding-optimization procedure adjusts token embeddings for misaligned moral theories to minimize JS divergence to the consensus. Validation on 42,501 AITA-derived moral dilemmas shows the aggregator yields coherent collective opinions and that theory-token embedding refinements can substantially improve alignment for underperforming models, providing a data-driven path to safer, more consistent multi-LLM moral reasoning. The approach highlights how cross-model aggregation and localized representation edits can stabilize moral judgments in AI systems, with implications for alignment and governance in subjective domains.

Abstract

Large Language Models (LLMs) have shown impressive moral reasoning abilities. Yet they often diverge when confronted with complex, multi-factor moral dilemmas. To address these discrepancies, we propose a framework that synthesizes multiple LLMs' moral judgments into a collectively formulated moral judgment, realigning models that deviate significantly from this consensus. Our aggregation mechanism fuses continuous moral acceptability scores (beyond binary labels) into a collective probability, weighting contributions by model reliability. For misaligned models, a targeted embedding-optimization procedure fine-tunes token embeddings for moral philosophical theories, minimizing JS divergence to the consensus while preserving semantic integrity. Experiments on a large-scale social moral dilemma dataset show our approach builds robust consensus and improves individual model fidelity. These findings highlight the value of data-driven moral alignment across multiple models and its potential for safer, more consistent AI systems.

Paper Structure

This paper contains 41 sections, 9 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: An example of LLMs assessing a moral dilemma from deontological and utilitarian perspectives.
  • Figure 2: Framework for Collective Moral Reasoning. Multiple LLMs assess moral dilemmas based on different moral philosophical theories (referred to as moral concepts in the figure for brevity). Their judgments are aggregated into a collective opinion using the Truncated-Normal EM algorithm, while misaligned models undergo targeted embedding optimization and re-evaluation to improve consistency.
  • Figure 3: Comparison of four Llama Variants with Other LLMs. LLMs A–E correspond to a specific version of Llama, GPT-3.5, Claude, Moonshot, and GPT-4omini, whereas concepts $\text{A}^\prime$–$\text{E}^\prime$ represent moral theories of deontology, utilitarianism, commonsense, justice, and virtue. + denotes the LLM holding the highest F1 score for each moral theory, while $\textbf{×}$ marks the lowest. The F1 score is computed using the same metric described in Table \ref{['tab:main_results_F1']}.
  • Figure 4: PCA+t-SNE Projection of Deontology-related Token Embeddings. The term “concept” represents moral philosophical theory in this figure. $^*[\text{concept}]\_i$ represents the moral-theory token trained from the $i\text{th}$ original token.
  • Figure 5: Impact of Random01 on Mean-based (Left) and Our (Right) Aggregation Strategy. This table shows how Random01 impacts the basic LLMs' F1 scores per theory. Each box represents a theory, with top, middle, and bottom lines showing the highest, mean, and lowest values of F1 score differences among LLMs.
  • ...and 7 more figures