Table of Contents
Fetching ...

DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

Mingxi Zou, Jiaxiang Chen, Junfan Li, Langzhang Liang, Qifan Wang, Xu Yinghui, Zenglin Xu

TL;DR

DARC is proposed, a retraining-free inference-time method that frames response selection as distributionally robust, risk-sensitive decision making and provides simple deployment controls that cap or penalize the corresponding entropic risk premium relative to the mean, enabling explicit risk budgets without retraining.

Abstract

Preference-based alignment methods (e.g., RLHF, DPO) typically optimize a single scalar objective, implicitly averaging over heterogeneous human preferences. In practice, systematic annotator and user-group disagreement makes mean-reward maximization brittle and susceptible to proxy over-optimization. We propose **Disagreement-Aware Alignment via Risk-Constrained Decoding (DARC)**, a retraining-free inference-time method that frames response selection as distributionally robust, risk-sensitive decision making. Given multiple preference samples or scalable disagreement proxies, DARC reranks candidates by maximizing a *KL-robust (entropic)* satisfaction objective, and provides simple deployment controls that cap or penalize the corresponding entropic risk premium relative to the mean, enabling explicit risk budgets without retraining. We provide theoretical characterization linking this decoding rule to principled pessimism and KL-based distributionally robust optimization. Experiments on alignment benchmarks show that DARC reduces disagreement and tail risk while maintaining competitive average quality under noisy, heterogeneous feedback.

DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

TL;DR

DARC is proposed, a retraining-free inference-time method that frames response selection as distributionally robust, risk-sensitive decision making and provides simple deployment controls that cap or penalize the corresponding entropic risk premium relative to the mean, enabling explicit risk budgets without retraining.

Abstract

Preference-based alignment methods (e.g., RLHF, DPO) typically optimize a single scalar objective, implicitly averaging over heterogeneous human preferences. In practice, systematic annotator and user-group disagreement makes mean-reward maximization brittle and susceptible to proxy over-optimization. We propose **Disagreement-Aware Alignment via Risk-Constrained Decoding (DARC)**, a retraining-free inference-time method that frames response selection as distributionally robust, risk-sensitive decision making. Given multiple preference samples or scalable disagreement proxies, DARC reranks candidates by maximizing a *KL-robust (entropic)* satisfaction objective, and provides simple deployment controls that cap or penalize the corresponding entropic risk premium relative to the mean, enabling explicit risk budgets without retraining. We provide theoretical characterization linking this decoding rule to principled pessimism and KL-based distributionally robust optimization. Experiments on alignment benchmarks show that DARC reduces disagreement and tail risk while maintaining competitive average quality under noisy, heterogeneous feedback.
Paper Structure (187 sections, 16 theorems, 127 equations, 6 figures, 15 tables, 1 algorithm)

This paper contains 187 sections, 16 theorems, 127 equations, 6 figures, 15 tables, 1 algorithm.

Key Result

Proposition 3.3

There exists an absolute constant $c>0$ such that for any $\delta\in(0,1)$, with probability at least $1-\delta$, simultaneously for all $y\in\mathcal{Y}(s)$, We denote the right-hand side by $\mathrm{LCB}_\delta(y)$.

Figures (6)

  • Figure 1: Score Distribution shift. Ridge plot showing human score densities on the high-disagreement subset. DARC variants (blue) shift the distribution to the right (higher mean $\mu$) compared to the baseline (grey), with reduced spread (lower $\sigma$), indicating both increased satisfaction and reduced disagreement.
  • Figure 2: Ablation Studies. Impact of key hyperparameters on risk mitigation performance. (a) Candidate pool size $K$. (b) Risk sensitivity coefficient $\beta$. (c) Constraint threshold $\epsilon$. (d) Perturbation budget $N_{\text{aug}}$.
  • Figure 3: Gains concentrate on high-disagreement prompts. Mean improvement in lower-tail satisfaction ($\Delta$Tradeoff vs. base) across five prompt buckets ranked by baseline human disagreement $\hat{\sigma}$ (low$\rightarrow$high). Error bars denote 95% CIs.
  • Figure 4: Proxy validity diagnostics.(Top) Rank correlation between proxy and human disagreement, with top-20% overlap. (Bottom) Top-$q$ overlap (Left) and proxy vs. human disagreement scatter (Right).
  • Figure 5: Conservative metric exhibits the same bucketed trend. Bucketed improvements (vs. base) for a conservative cvar-style score (e.g., $\Delta \mathrm{CVaR}_{10}$), , using the same human-disagreement buckets as Fig. \ref{['fig:bucket_dcvar']}. Bars show mean; error bars denote 95% CIs.
  • ...and 1 more figures

Theorems & Definitions (28)

  • Remark 3.1: Shared annotators across candidates
  • Proposition 3.3: Uniform LCB
  • Remark 3.4: Variance governs estimation hardness
  • Theorem 3.5: KL-robust value equals an entropic objective
  • Proposition 3.6: $\chi^2$-DRO robust mean admits a mean--dispersion form
  • Proposition 4.1: Scorer aggregation as KL-regularized DRO over scorers
  • Lemma 1.1: Empirical Bernstein bound (bounded case)
  • Corollary 1.2: Mean--dispersion surrogate form under bounded ratings
  • proof
  • Remark 1.3: Standard deviation vs. Variance penalization
  • ...and 18 more