Democratic Preference Alignment via Sortition-Weighted RLHF

Suvadip Sana; Jinzhou Wu; Martin T. Wells

Democratic Preference Alignment via Sortition-Weighted RLHF

Suvadip Sana, Jinzhou Wu, Martin T. Wells

TL;DR

DemPO addresses bias in RLHF feedback by enforcing demographic representativeness through sortition-based mini-publics. It introduces Hard Panel and Soft Panel training schemes that connect a quota-based panel sampling procedure (LEXIMIN) to both panel-constrained updates and representativeness-weighted updates, respectively. Using PRISM data and a representative US-constituted constitution, the authors show that demographic-representativeness at the feedback stage yields improvements across multiple aggregation methods and model sizes, with panel-guided training increasingly advantageous as model capacity grows. This approach offers a practical path toward aligning AI systems with broader public values, while acknowledging limitations around data, locales, and judge biases.

Abstract

Whose values should AI systems learn? Preference based alignment methods like RLHF derive their training signal from human raters, yet these rater pools are typically convenience samples that systematically over represent some demographics and under represent others. We introduce Democratic Preference Optimization, or DemPO, a framework that applies algorithmic sortition, the same mechanism used to construct citizen assemblies, to preference based fine tuning. DemPO offers two training schemes. Hard Panel trains exclusively on preferences from a quota satisfying mini public sampled via sortition. Soft Panel retains all data but reweights each rater by their inclusion probability under the sortition lottery. We prove that Soft Panel weighting recovers the expected Hard Panel objective in closed form. Using a public preference dataset that pairs human judgments with rater demographics and a seventy five clause constitution independently elicited from a representative United States panel, we evaluate Llama models from one billion to eight billion parameters fine tuned under each scheme. Across six aggregation methods, the Hard Panel consistently ranks first and the Soft Panel consistently outperforms the unweighted baseline, with effect sizes growing as model capacity increases. These results demonstrate that enforcing demographic representativeness at the preference collection stage, rather than post hoc correction, yields models whose behavior better reflects values elicited from representative publics.

Democratic Preference Alignment via Sortition-Weighted RLHF

TL;DR

Abstract

Paper Structure (43 sections, 2 theorems, 23 equations, 5 figures, 14 tables)

This paper contains 43 sections, 2 theorems, 23 equations, 5 figures, 14 tables.

Introduction
Our Contributions
Related Work
Preference-based alignment and RLHF.
Whose preferences and social choice perspectives.
Democratic and diverse feedback collection.
Methodology
Population, pool, and preference data.
Multi-turn contexts and pair construction (PRISM instantiation).
Sortition over panels.
Hard Panel Training
Soft Panel Weighting
Experiments
Data: PRISM Preferences and Demographics
Sortition Targets and Panel Configuration
...and 28 more sections

Key Result

Lemma 1.1

Let $S \sim \pi_{\mathrm{panel}}$ be a fixed size random panel $|S|=k$, and let $\pi_i = \Pr[i\in S]$. Then $\sum_{i\in\mathcal{I}} \pi_i = k$.

Figures (5)

Figure 1: The DemPO pipeline for democratic preference alignment. A biased, self-selected pool of data labelers is transformed into a demographically representative mini-public via algorithmic sortition subject to population-derived quota constraints. Preferences from this representative panel (Hard Panel) or selection-probability-weighted preferences from all raters (Soft Panel) are then used for RLHF training, yielding AI systems aligned with broader public values.
Figure 2: Model ranking under multiple aggregation methods (Llama-3.1-8B). Left: Borda and Copeland scores with 95% bootstrap confidence intervals, and Kemeny consensus summarized as rank-position probabilities under bootstrap resampling. Right: Bradley--Terry and Plackett--Luce log-ability scores with 95% bootstrap confidence intervals, and Mallows (Kendall) rank-position probabilities under bootstrap resampling (with fitted $\phi$ and held-out $\ell_{\text{test}}$). All bootstrap summaries use $n{=}1000$ resamples. Across BT, Borda, Copeland, Kemeny, and Mallows, Hard Panel ranks highest, US-Rep ranks second, Soft Panel ranks above the Full PRISM baseline, and the Base model is consistently worst.
Figure 3: Panel advantage grows with model size. Effect sizes are computed from bootstrap resampling of listwise rankings (Borda average score differences). The Soft Panel vs. Full PRISM gap and the Hard Panel vs. US-Rep gap both increase from 1B to 3B to 8B. All error bars are 95% bootstrap CIs ($n{=}1000$).
Figure 4: Score grids for Llama-3.2-1B (left) and Llama-3.2-3B (right) under the same evaluation protocol.
Figure 5: Normalized-step score grid for Llama-3.1-8B, where Hard Panel and US-Rep are retrained for fewer steps (Hard: $2200$, US-Rep: $2100$) to match the update magnitude of Full/Soft.

Theorems & Definitions (4)

Lemma 1.1: Inclusion probabilities sum to the panel size
proof
Lemma 1.2: Soft weights recover the expected hard-panel objective
proof

Democratic Preference Alignment via Sortition-Weighted RLHF

TL;DR

Abstract

Democratic Preference Alignment via Sortition-Weighted RLHF

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (4)