Fine-tuning language models to find agreement among humans with diverse preferences

Michiel A. Bakker; Martin J. Chadwick; Hannah R. Sheahan; Michael Henry Tessler; Lucy Campbell-Gillingham; Jan Balaguer; Nat McAleese; Amelia Glaese; John Aslanides; Matthew M. Botvinick; Christopher Summerfield

Fine-tuning language models to find agreement among humans with diverse preferences

Michiel A. Bakker, Martin J. Chadwick, Hannah R. Sheahan, Michael Henry Tessler, Lucy Campbell-Gillingham, Jan Balaguer, Nat McAleese, Amelia Glaese, John Aslanides, Matthew M. Botvinick, Christopher Summerfield

TL;DR

The paper explores how large language models can be fine-tuned to help groups with diverse preferences reach consensus. It collects UK-based opinions on policy questions, generates multiple consensus candidates, and reranks them using reward models tied to isoelastic social welfare functions to maximize group agreement. Across extensive human evaluations, the welfare-optimized model outperforms baselines and human opinions, and demonstrates robustness to unseen topics, while analyses reveal the consensus depends on the specific input opinions. These results suggest LLMs can assist collective deliberation by producing group-aligned consensus statements, albeit with attention to biases and ethical safeguards.

Abstract

Recent work in large language modeling (LLMs) has used fine-tuning to align outputs with the preferences of a prototypical user. This work assumes that human preferences are static and homogeneous across individuals, so that aligning to a a single "generic" user will confer more general alignment. Here, we embrace the heterogeneity of human preferences to consider a different challenge: how might a machine help people with diverse views find agreement? We fine-tune a 70 billion parameter LLM to generate statements that maximize the expected approval for a group of people with potentially diverse opinions. Human participants provide written opinions on thousands of questions touching on moral and political issues (e.g., "should we raise taxes on the rich?"), and rate the LLM's generated candidate consensus statements for agreement and quality. A reward model is then trained to predict individual preferences, enabling it to quantify and rank consensus statements in terms of their appeal to the overall group, defined according to different aggregation (social welfare) functions. The model produces consensus statements that are preferred by human users over those from prompted LLMs (>70%) and significantly outperforms a tight fine-tuned baseline that lacks the final ranking step. Further, our best model's consensus statements are preferred over the best human-generated opinions (>65%). We find that when we silently constructed consensus statements from only a subset of group members, those who were excluded were more likely to dissent, revealing the sensitivity of the consensus to individual contributions. These results highlight the potential to use LLMs to help groups of humans align their values with one another.

Fine-tuning language models to find agreement among humans with diverse preferences

TL;DR

Abstract

Paper Structure (41 sections, 3 equations, 14 figures, 8 tables)

This paper contains 41 sections, 3 equations, 14 figures, 8 tables.

Introduction
Related Work
Methods
Generating debate questions
Data collection and environment design
Group alignment
Training
Results
Preferences for consensus over baselines
Preferences for model candidates over human opinions
Opinion exclusion analysis
Out-of-distribution generalisation
Discussion
Limitations
Broader Impacts
...and 26 more sections

Figures (14)

Figure 1: Overview of the data collection procedure. The evaluation pipeline proceeded in six steps. (1) Human participants, sorted into small groups ($n \in \{3,4,5\}$), each wrote a short paragraph stating their opinion about a political question (e.g., "should we lower the speed limit on roads?"). (2) These opinions, together with the question, were passed to a prompted pre-trained LLM (or, a fine-tuned LLM on later rounds) via the prompt, which generated consensus candidates. (3) Pairs of participant opinions and candidate consensus statements were passed into a reward model, which estimated the degree to which each participant would agree with a candidate consensus. (4) For each consensus candidate, the set of predicted individual preferences were aggregated with a social welfare function. (5) From a batch of consensus candidates, the one that maximised welfare was selected for human evaluation. (6) Participants then rated this consensus candidate, together with candidates generated in other batches or conditions, on a 7-point agreement scale. Quality ratings were used to filter the data for later fine-tuning and agreement ratings for training the reward model and for evaluation.
Figure 2: Win rates for comparing models constructed by pairwise comparison of Likert agreement ratings for candidate consensus statements (excluding ties) for within-distribution (blue) and out-of-distribution (green) question sets. Likert agreement ratings are aggregated within groups by either the mean (dark bars) or the minimum (light bars) agreement score. A: Win-rates for the SFT-Utilitarian model in comparison to baselines. B: Win-rates for the SFT-Utilitarian model broken down by whether or not the question was divisive in the group (see main text for details). Error-bars represent 95% bootstrapped confidence intervals.
Figure 3: Distributions over Likert ratings for candidate consensus statements generated by the SFT-Utilitarian model and baseline models. A: Agreement ratings. B: Quality ratings. Error-bars represent 95% bootstrapped confidence intervals. See Figure A11 for agreement scores broken down by question divisiveness.
Figure A1: K-means clustering of the debate question embeddings. A: The Silhouette score as a function of the number of clusters. B: The distribution of questions per cluster (total of 2922 questions).
Figure A2: Main task instructions provided to all participants. Additional information (not shown) was provided regarding the task time limits and the button for reporting any offensive or inappropriate content.
...and 9 more figures

Fine-tuning language models to find agreement among humans with diverse preferences

TL;DR

Abstract

Fine-tuning language models to find agreement among humans with diverse preferences

Authors

TL;DR

Abstract

Table of Contents

Figures (14)