Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models
Chenchen Yuan, Zheyu Zhang, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci
TL;DR
This work addresses divergent moral judgments across multiple LLMs when confronting complex dilemmas by introducing a probabilistic, reliability-weighted aggregator that fuses continuous moral-acceptability scores $a_{m,j,i}$ into a collective probability $\gamma_{j,i}$. The core method uses a truncated-normal EM framework to learn model reliabilities and consensus, while a targeted embedding-optimization procedure adjusts token embeddings for misaligned moral theories to minimize JS divergence to the consensus. Validation on 42,501 AITA-derived moral dilemmas shows the aggregator yields coherent collective opinions and that theory-token embedding refinements can substantially improve alignment for underperforming models, providing a data-driven path to safer, more consistent multi-LLM moral reasoning. The approach highlights how cross-model aggregation and localized representation edits can stabilize moral judgments in AI systems, with implications for alignment and governance in subjective domains.
Abstract
Large Language Models (LLMs) have shown impressive moral reasoning abilities. Yet they often diverge when confronted with complex, multi-factor moral dilemmas. To address these discrepancies, we propose a framework that synthesizes multiple LLMs' moral judgments into a collectively formulated moral judgment, realigning models that deviate significantly from this consensus. Our aggregation mechanism fuses continuous moral acceptability scores (beyond binary labels) into a collective probability, weighting contributions by model reliability. For misaligned models, a targeted embedding-optimization procedure fine-tunes token embeddings for moral philosophical theories, minimizing JS divergence to the consensus while preserving semantic integrity. Experiments on a large-scale social moral dilemma dataset show our approach builds robust consensus and improves individual model fidelity. These findings highlight the value of data-driven moral alignment across multiple models and its potential for safer, more consistent AI systems.
