Table of Contents
Fetching ...

Leveraging Robust Optimization for LLM Alignment under Distribution Shifts

Mingye Zhu, Yi Liu, Zheren Fu, Yongdong Zhang, Zhendong Mao

TL;DR

The paper addresses distribution shifts in LLM preference alignment caused by synthetic data, introducing DoRA, a distribution-aware robust optimization framework. DoRA combines per-sample calibration from probabilistic classifiers with a KL-DRO objective to emphasize data regions aligned with the target human distribution while downweighting misaligned samples; the method yields a tractable dual form and is designed as a modular plug-in. The authors formalize the mixture response shift $P(y|x)=\alpha Q_0(y|x)+\sum_i\beta_i Q_i(y|x)$ and derive the DoRA objective, including a calibrated weighting term $\tilde{h}(\mathbf{z})$, enabling principled robustness control. Empirically, DoRA improves alignment performance across pairwise and listwise settings, enhances reward-confidence correlation, and demonstrates robustness to corruption, label noise, and online adaptation, suggesting broad applicability to real-world alignment challenges.

Abstract

Preference alignment methods are increasingly critical for steering large language models (LLMs) to generate outputs consistent with human values. While recent approaches often rely on synthetic data generated by LLMs for scalability and cost-efficiency reasons, this reliance can introduce distribution shifts that undermine the nuanced representation of human preferences needed for desirable outputs. In this paper, we propose a novel distribution-aware optimization framework that improves preference alignment despite such shifts. Our approach first leverages well-learned classifiers to assign a calibration value to each training sample, quantifying its alignment with the target human-preferred distribution. These values are then incorporated into a robust optimization objective that minimizes the worst-case loss over regions of the data space most relevant to human preferences. By explicitly focusing optimization on the target distribution, our approach mitigates the impact of distributional mismatch and improves the generation of responses that better reflect intended values.

Leveraging Robust Optimization for LLM Alignment under Distribution Shifts

TL;DR

The paper addresses distribution shifts in LLM preference alignment caused by synthetic data, introducing DoRA, a distribution-aware robust optimization framework. DoRA combines per-sample calibration from probabilistic classifiers with a KL-DRO objective to emphasize data regions aligned with the target human distribution while downweighting misaligned samples; the method yields a tractable dual form and is designed as a modular plug-in. The authors formalize the mixture response shift and derive the DoRA objective, including a calibrated weighting term , enabling principled robustness control. Empirically, DoRA improves alignment performance across pairwise and listwise settings, enhances reward-confidence correlation, and demonstrates robustness to corruption, label noise, and online adaptation, suggesting broad applicability to real-world alignment challenges.

Abstract

Preference alignment methods are increasingly critical for steering large language models (LLMs) to generate outputs consistent with human values. While recent approaches often rely on synthetic data generated by LLMs for scalability and cost-efficiency reasons, this reliance can introduce distribution shifts that undermine the nuanced representation of human preferences needed for desirable outputs. In this paper, we propose a novel distribution-aware optimization framework that improves preference alignment despite such shifts. Our approach first leverages well-learned classifiers to assign a calibration value to each training sample, quantifying its alignment with the target human-preferred distribution. These values are then incorporated into a robust optimization objective that minimizes the worst-case loss over regions of the data space most relevant to human preferences. By explicitly focusing optimization on the target distribution, our approach mitigates the impact of distributional mismatch and improves the generation of responses that better reflect intended values.

Paper Structure

This paper contains 32 sections, 2 theorems, 35 equations, 7 figures, 9 tables, 1 algorithm.

Key Result

Proposition 3.1

Let $P(y |x)=\alpha \, Q_0(y |x)+ \sum_{i=1}^{n-1}\beta_i \, Q_i(y |x),$ with $\alpha + \beta_1 + \cdots + \beta_{n-1} = 1$ and $\alpha \in (0,1)$. Under the mixture response shift, we define $\tilde{h}(\mathbf{z})$ as an empirical estimate of the degree to which a given sample aligns with human p where $w_{\phi_i}$ is defined earlier in Equation eq:imp_wei.

Figures (7)

  • Figure 1: Comparison of ERM and DoRA Training. The left section illustrates the training distribution $P$, which is a mixture of human-preferred (target) distribution $Q_0$ and LLM distributions $Q_1$, highlighting the mixture response shift. The right section contrasts the outcomes of the traditional method (ERM training) over $P$ with the proposed DoRA training over $Q_0$, demonstrating how DoRA better aligns with the target distribution.
  • Figure 2: The DoRA pipeline. For each datum $\mathbf{z}$, where responses are drawn from a mixture of distributions, DoRA uses trained classifiers to estimate the alignment of each $y$ with the target distribution. These scores are then aggregated into a calibration term $\tilde{h}(\mathbf{z})$ for each sample, which reweights the original loss $\ell(\mathbf{z})$ during optimization to enable more principled robustness control.
  • Figure 3: DoRA boosts pairwise baseline performance. When applied to unaugmented "golden" preference datasets, DoRA consistently enhances response quality across all baselines.
  • Figure 4: Reward-confidence correlation for generated responses. DoRA exhibits stronger reward-confidence calibration than baselines, evidenced by steeper regression slopes ($\Delta \beta$=+0.174 for Mistral and $\Delta \beta=+0.131$ for Llama, larger slope means better correlation). This indicates DoRA's high-reward outputs more closely match the target distribution's characteristics, validated by elevated classifier probabilities.
  • Figure 5: Performance variation with different choices of $\lambda$ for vanilla DRO and DoRA. We observe that as $\lambda$ increases from 0.5 to 4.0, the win rate generally decreases, albeit with some variations. Besides, vanilla DRO generally downperforms the proposed DoRA.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Definition 2.1: Mixture Response Shift
  • Remark 1
  • Proposition 3.1
  • Remark 2
  • Proposition 3.2: Worst-case risk under mixture response shift
  • Remark 3