DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

Mingxi Zou; Jiaxiang Chen; Junfan Li; Langzhang Liang; Qifan Wang; Xu Yinghui; Zenglin Xu

DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

Mingxi Zou, Jiaxiang Chen, Junfan Li, Langzhang Liang, Qifan Wang, Xu Yinghui, Zenglin Xu

TL;DR

DARC is proposed, a retraining-free inference-time method that frames response selection as distributionally robust, risk-sensitive decision making and provides simple deployment controls that cap or penalize the corresponding entropic risk premium relative to the mean, enabling explicit risk budgets without retraining.

Abstract

Preference-based alignment methods (e.g., RLHF, DPO) typically optimize a single scalar objective, implicitly averaging over heterogeneous human preferences. In practice, systematic annotator and user-group disagreement makes mean-reward maximization brittle and susceptible to proxy over-optimization. We propose **Disagreement-Aware Alignment via Risk-Constrained Decoding (DARC)**, a retraining-free inference-time method that frames response selection as distributionally robust, risk-sensitive decision making. Given multiple preference samples or scalable disagreement proxies, DARC reranks candidates by maximizing a *KL-robust (entropic)* satisfaction objective, and provides simple deployment controls that cap or penalize the corresponding entropic risk premium relative to the mean, enabling explicit risk budgets without retraining. We provide theoretical characterization linking this decoding rule to principled pessimism and KL-based distributionally robust optimization. Experiments on alignment benchmarks show that DARC reduces disagreement and tail risk while maintaining competitive average quality under noisy, heterogeneous feedback.

DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

TL;DR

Abstract

Paper Structure (187 sections, 16 theorems, 127 equations, 6 figures, 15 tables, 1 algorithm)

This paper contains 187 sections, 16 theorems, 127 equations, 6 figures, 15 tables, 1 algorithm.

Introduction
Contributions.
Problem setup
Conditioning on the candidate set.
Latent satisfaction under heterogeneous preferences.
KL-robust (entropic) value and risk premium.
Decision problem (risk-aware decoding).
Guarantees via Lower Confidence Bounds
Scalar satisfaction samples (guarantee setting).
Estimation risk: uniform LCB and a mean--dispersion surrogate
Bridge to pairwise preferences.
Lower-tail interpretation.
On constants and practical calibration.
LCB decoding and a mean--dispersion surrogate.
Distributional risk: DRO characterizations of pessimistic value
...and 172 more sections

Key Result

Proposition 3.3

There exists an absolute constant $c>0$ such that for any $\delta\in(0,1)$, with probability at least $1-\delta$, simultaneously for all $y\in\mathcal{Y}(s)$, We denote the right-hand side by $\mathrm{LCB}_\delta(y)$.

Figures (6)

Figure 1: Score Distribution shift. Ridge plot showing human score densities on the high-disagreement subset. DARC variants (blue) shift the distribution to the right (higher mean $\mu$) compared to the baseline (grey), with reduced spread (lower $\sigma$), indicating both increased satisfaction and reduced disagreement.
Figure 2: Ablation Studies. Impact of key hyperparameters on risk mitigation performance. (a) Candidate pool size $K$. (b) Risk sensitivity coefficient $\beta$. (c) Constraint threshold $\epsilon$. (d) Perturbation budget $N_{\text{aug}}$.
Figure 3: Gains concentrate on high-disagreement prompts. Mean improvement in lower-tail satisfaction ($\Delta$Tradeoff vs. base) across five prompt buckets ranked by baseline human disagreement $\hat{\sigma}$ (low$\rightarrow$high). Error bars denote 95% CIs.
Figure 4: Proxy validity diagnostics.(Top) Rank correlation between proxy and human disagreement, with top-20% overlap. (Bottom) Top-$q$ overlap (Left) and proxy vs. human disagreement scatter (Right).
Figure 5: Conservative metric exhibits the same bucketed trend. Bucketed improvements (vs. base) for a conservative cvar-style score (e.g., $\Delta \mathrm{CVaR}_{10}$), , using the same human-disagreement buckets as Fig. \ref{['fig:bucket_dcvar']}. Bars show mean; error bars denote 95% CIs.
...and 1 more figures

Theorems & Definitions (28)

Remark 3.1: Shared annotators across candidates
Proposition 3.3: Uniform LCB
Remark 3.4: Variance governs estimation hardness
Theorem 3.5: KL-robust value equals an entropic objective
Proposition 3.6: $\chi^2$-DRO robust mean admits a mean--dispersion form
Proposition 4.1: Scorer aggregation as KL-regularized DRO over scorers
Lemma 1.1: Empirical Bernstein bound (bounded case)
Corollary 1.2: Mean--dispersion surrogate form under bounded ratings
proof
Remark 1.3: Standard deviation vs. Variance penalization
...and 18 more

DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

TL;DR

Abstract

DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (28)