Table of Contents
Fetching ...

Collective Constitutional AI: Aligning a Language Model with Public Input

Saffron Huang, Divya Siddarth, Liane Lovitt, Thomas I. Liao, Esin Durmus, Alex Tamkin, Deep Ganguli

TL;DR

This work introduces Collective Constitutional AI (CCAI), a four-stage framework that uses Polis-based public input to derive a constitution and then fine-tunes a language model to align with those public principles via Constitutional AI. It demonstrates feasibility by training two models—one guided by a Public constitution derived from a representative US sample and one by a Standard constitution—then evaluating them on language, math, bias, and political-ideology benchmarks. The Public model shows lower bias across nine social dimensions (BBQ) while maintaining comparable performance on MMLU and GSM8K, and qualitative analysis reveals it tends to reframe contentious topics positively rather than outright refuse. The results suggest that publicly informed constitutional alignment is a tractable approach to more inclusive and legitimate AI systems, though they acknowledge limitations in measuring adherence to principles and the representativeness of the input population.

Abstract

There is growing consensus that language model (LM) developers should not be the sole deciders of LM behavior, creating a need for methods that enable the broader public to collectively shape the behavior of LM systems that affect them. To address this need, we present Collective Constitutional AI (CCAI): a multi-stage process for sourcing and integrating public input into LMs-from identifying a target population to sourcing principles to training and evaluating a model. We demonstrate the real-world practicality of this approach by creating what is, to our knowledge, the first LM fine-tuned with collectively sourced public input and evaluating this model against a baseline model trained with established principles from a LM developer. Our quantitative evaluations demonstrate several benefits of our approach: the CCAI-trained model shows lower bias across nine social dimensions compared to the baseline model, while maintaining equivalent performance on language, math, and helpful-harmless evaluations. Qualitative comparisons of the models suggest that the models differ on the basis of their respective constitutions, e.g., when prompted with contentious topics, the CCAI-trained model tends to generate responses that reframe the matter positively instead of a refusal. These results demonstrate a promising, tractable pathway toward publicly informed development of language models.

Collective Constitutional AI: Aligning a Language Model with Public Input

TL;DR

This work introduces Collective Constitutional AI (CCAI), a four-stage framework that uses Polis-based public input to derive a constitution and then fine-tunes a language model to align with those public principles via Constitutional AI. It demonstrates feasibility by training two models—one guided by a Public constitution derived from a representative US sample and one by a Standard constitution—then evaluating them on language, math, bias, and political-ideology benchmarks. The Public model shows lower bias across nine social dimensions (BBQ) while maintaining comparable performance on MMLU and GSM8K, and qualitative analysis reveals it tends to reframe contentious topics positively rather than outright refuse. The results suggest that publicly informed constitutional alignment is a tractable approach to more inclusive and legitimate AI systems, though they acknowledge limitations in measuring adherence to principles and the representativeness of the input population.

Abstract

There is growing consensus that language model (LM) developers should not be the sole deciders of LM behavior, creating a need for methods that enable the broader public to collectively shape the behavior of LM systems that affect them. To address this need, we present Collective Constitutional AI (CCAI): a multi-stage process for sourcing and integrating public input into LMs-from identifying a target population to sourcing principles to training and evaluating a model. We demonstrate the real-world practicality of this approach by creating what is, to our knowledge, the first LM fine-tuned with collectively sourced public input and evaluating this model against a baseline model trained with established principles from a LM developer. Our quantitative evaluations demonstrate several benefits of our approach: the CCAI-trained model shows lower bias across nine social dimensions compared to the baseline model, while maintaining equivalent performance on language, math, and helpful-harmless evaluations. Qualitative comparisons of the models suggest that the models differ on the basis of their respective constitutions, e.g., when prompted with contentious topics, the CCAI-trained model tends to generate responses that reframe the matter positively instead of a refusal. These results demonstrate a promising, tractable pathway toward publicly informed development of language models.
Paper Structure (35 sections, 1 equation, 8 figures, 1 table)

This paper contains 35 sections, 1 equation, 8 figures, 1 table.

Figures (8)

  • Figure 1: This flowchart captures the stages of the CCAI method and some significant design decisions we made along the way. We hope that explicitly listing these decisions is useful for adapting the CCAI process to different contexts.
  • Figure 2: The most representative statements for each group, based on the relative odds ratio of the probability of a person in group $g$ voting $v$ on a comment, compared to those not in $g$small2021polis. Each statement has three bars: overall votes, Group A votes, and Group B votes. The bars show the proportions of "Agree" (green), "Disagree" (red), and "Pass / Unsure" (grey) votes, with white representing users who didn't see/vote on the statement.
  • Figure 3: (Left) Distribution of group aware consensus (GAC) of all the statements, and threshold for inclusion (red line) (Right) Distribution of the 'polarization indices'. Polarization tends to be low.
  • Figure 4: BBQ bias scores. In all cases, the Public model achieved a lower bias score than the Standard model.
  • Figure 5: A heatmap of OpinionQA scores showing how well each model reflects different U.S. political ideologies.
  • ...and 3 more figures