Table of Contents
Fetching ...

Online Self-Preferring Language Models

Yuanzhao Zhai, Zhuo Zhang, Kele Xu, Hanyang Peng, Yue Yu, Dawei Feng, Cheng Yang, Bo Ding, Huaimin Wang

TL;DR

This work tackles aligning large language models to human preferences by explicitly modeling preference strength, rather than relying solely on binary labels or substituted rewards. It introduces Online Self-Preferring (OSP) language models, which sample on-policy self-generated responses and use self-judged soft preference strengths via a soft-preference cross-entropy (SPCE) loss to guide learning. Empirical results on the HH and TL;DR datasets show state-of-the-art alignment, strong sample efficiency, and robust generalization to out-of-domain tasks, including a 95% win rate against the base model with only 200 prompts. OSP achieves these benefits with parameter efficiency and without external reward models, offering a promising path for self-improvement and more reliable online alignment, while noting limitations such as higher computation and potential bias in the LLM-as-a-judge.

Abstract

Aligning with human preference datasets has been critical to the success of large language models (LLMs). Reinforcement learning from human feedback (RLHF) employs a costly reward model to provide feedback for on-policy sampling responses. Recently, offline methods that directly fit responses with binary preferences in the dataset have emerged as alternatives. However, existing methods do not explicitly model preference strength information, which is crucial for distinguishing different response pairs. To overcome this limitation, we propose Online Self-Preferring (OSP) language models to learn from self-generated response pairs and self-judged preference strengths. For each prompt and corresponding self-generated responses, we introduce a ranked pairing method to construct multiple response pairs with preference strength information. We then propose the soft-preference cross-entropy loss to leverage such information. Empirically, we demonstrate that leveraging preference strength is crucial for avoiding overfitting and enhancing alignment performance. OSP achieves state-of-the-art alignment performance across various metrics in two widely used human preference datasets. OSP is parameter-efficient and more robust than the dominant online method, RLHF when limited offline data are available and generalizing to out-of-domain tasks. Moreover, OSP language models established by LLMs with proficiency in self-preferring can efficiently self-improve without external supervision.

Online Self-Preferring Language Models

TL;DR

This work tackles aligning large language models to human preferences by explicitly modeling preference strength, rather than relying solely on binary labels or substituted rewards. It introduces Online Self-Preferring (OSP) language models, which sample on-policy self-generated responses and use self-judged soft preference strengths via a soft-preference cross-entropy (SPCE) loss to guide learning. Empirical results on the HH and TL;DR datasets show state-of-the-art alignment, strong sample efficiency, and robust generalization to out-of-domain tasks, including a 95% win rate against the base model with only 200 prompts. OSP achieves these benefits with parameter efficiency and without external reward models, offering a promising path for self-improvement and more reliable online alignment, while noting limitations such as higher computation and potential bias in the LLM-as-a-judge.

Abstract

Aligning with human preference datasets has been critical to the success of large language models (LLMs). Reinforcement learning from human feedback (RLHF) employs a costly reward model to provide feedback for on-policy sampling responses. Recently, offline methods that directly fit responses with binary preferences in the dataset have emerged as alternatives. However, existing methods do not explicitly model preference strength information, which is crucial for distinguishing different response pairs. To overcome this limitation, we propose Online Self-Preferring (OSP) language models to learn from self-generated response pairs and self-judged preference strengths. For each prompt and corresponding self-generated responses, we introduce a ranked pairing method to construct multiple response pairs with preference strength information. We then propose the soft-preference cross-entropy loss to leverage such information. Empirically, we demonstrate that leveraging preference strength is crucial for avoiding overfitting and enhancing alignment performance. OSP achieves state-of-the-art alignment performance across various metrics in two widely used human preference datasets. OSP is parameter-efficient and more robust than the dominant online method, RLHF when limited offline data are available and generalizing to out-of-domain tasks. Moreover, OSP language models established by LLMs with proficiency in self-preferring can efficiently self-improve without external supervision.
Paper Structure (34 sections, 12 equations, 9 figures, 10 tables, 1 algorithm)

This paper contains 34 sections, 12 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison of LLMs alignment methods. Compared to RLHF, OSP leverages the LLM itself to provide the preference strength of response pairs, instead of leveraging a separate reward model to reward single responses. In contrast to offline methods, OSP can effectively learn from on-the-fly self-generated samples and their associated preference strengths.
  • Figure 2: Illustration of the aligning pipeline of OSP. For each training prompt ${\boldsymbol{x}}$, OSP first employs the current model $\pi_\theta$ to sample $K$ candidate responses, constructing $K//2$ response pairs in a ranked pairing manner. $\pi_\text{SFT}$ subsequently judges response pairs to obtain the preference strength. Finally, our proposed SPCE loss leverages multiple response pairs with different preference strengths to align LLM $\pi_\theta$, where $\pi_\text{SFT}$ is also used for normalization.
  • Figure 3: Head-to-head comparison on HH (left) and TL;DR dataset (right), where win rates (%) are evaluated by GPT-4. The winning rate indicates the percentage of samples of methods on the vertical axis that outperform those on the horizontal axis.
  • Figure 4: (a$\sim$b) Alignment of LLMs established by TinyLlama on OOD generalization tasks. (c) Alignment of LLMs established by Mistral-7B-Instruct-v0.2 without using a human preference dataset. All curves are averaged across 4 seeds, and the shaded area indicates the standard deviation.
  • Figure 5: Ablations of OSP language models established by TinyLlama trained on $2\%$ HH dataset with different loss functions, response sampling, and pair construction methods.
  • ...and 4 more figures