Table of Contents
Fetching ...

Doubly Robust Alignment for Large Language Models

Erhan Xu, Kai Ye, Hongyi Zhou, Luhan Zhu, Francesco Quinzan, Chengchun Shi

TL;DR

This work targets robust RLHF for aligning large language models by addressing misspecifications in the BT-based preference model and reference policy. It introduces a doubly robust estimator for total policy preference $p^*(\\pi)$ and a corresponding doubly robust preference optimization (DRPO) that remains consistent if either the preference model or the reference policy is correct, plus finite-sample and semi-parametric efficiency guarantees. The paper provides rigorous theory showing efficiency bounds and gap analyses, and demonstrates empirically that DRPO and its evaluation estimator outperform PPO and DPO across synthetic and real-world tasks under model misspecification. The approach yields practical improvements and opens a path toward safer, more robust LLM alignment, with code available publicly for reproducibility.

Abstract

This paper studies reinforcement learning from human feedback (RLHF) for aligning large language models with human preferences. While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying preference model (e.g., the Bradley-Terry model), the reference policy, or the reward function, resulting in undesirable fine-tuning. To address model misspecification, we propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified (without requiring both). Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice. The code is available at https://github.com/DRPO4LLM/DRPO4LLM

Doubly Robust Alignment for Large Language Models

TL;DR

This work targets robust RLHF for aligning large language models by addressing misspecifications in the BT-based preference model and reference policy. It introduces a doubly robust estimator for total policy preference and a corresponding doubly robust preference optimization (DRPO) that remains consistent if either the preference model or the reference policy is correct, plus finite-sample and semi-parametric efficiency guarantees. The paper provides rigorous theory showing efficiency bounds and gap analyses, and demonstrates empirically that DRPO and its evaluation estimator outperform PPO and DPO across synthetic and real-world tasks under model misspecification. The approach yields practical improvements and opens a path toward safer, more robust LLM alignment, with code available publicly for reproducibility.

Abstract

This paper studies reinforcement learning from human feedback (RLHF) for aligning large language models with human preferences. While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying preference model (e.g., the Bradley-Terry model), the reference policy, or the reward function, resulting in undesirable fine-tuning. To address model misspecification, we propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified (without requiring both). Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice. The code is available at https://github.com/DRPO4LLM/DRPO4LLM

Paper Structure

This paper contains 21 sections, 8 theorems, 82 equations, 6 figures, 14 tables, 1 algorithm.

Key Result

Lemma 1

Assume $w(y,x)<\infty$ for any $x$, $y$. Then $p^*(\pi)=\frac{1}{2}\mathbb{E} [w(Y^{(1)},X)Z+ w(Y^{(2)},X)(1-Z)]$.

Figures (6)

  • Figure 1: A visualization of our proposed preference optimization algorithm. $\widehat{\pi}_{\textrm{ref}}$ denotes the specified reference policy whereas $\widehat{g}$ denotes the specified preference model. Our proposal is doubly robust in that it requires correct specification of either the reference policy, or the preference model.
  • Figure 2: A visualization of our theoretical findings.
  • Figure 3: MSEs of different preference evaluation estimators on the IMDb dataset. Shaded areas visualize the 95% confidence bands.
  • Figure 4: Pairwise win rate matrices between different methods across two datasets. Left: TL;DR. Right: HH. Each entry indicates how often the row method outperforms the column method.
  • Figure 5: Pairwise Win Rates on TL;DR Dataset under different sampling temperatures (left: 0.75; right: 0.25)
  • ...and 1 more figures

Theorems & Definitions (9)

  • Lemma 1
  • Theorem 2: MSE
  • Corollary 3: Doubly robust evaluation
  • Corollary 4: Semi-parametric efficiency
  • Theorem 5: Performance gap
  • Corollary 6: Doubly robust optimization
  • Theorem 7: Suboptimality gap
  • Lemma 8
  • proof : Proof of Lemma \ref{['lem:EIF']}