Table of Contents
Fetching ...

Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model

Wenhong Zhu, Zhiwei He, Xiaofeng Wang, Pengfei Liu, Rui Wang

TL;DR

The results suggest that using the weak model to elicit a strong model with a high alignment ability is feasible and a method called Weak-to-Strong Preference Optimization (WSPO), which achieves strong model alignment by learning the distribution differences before and after the alignment of the weak model is proposed.

Abstract

Aligning language models (LMs) with human preferences has become a key area of research, enabling these models to meet diverse user needs better. Inspired by weak-to-strong generalization, where a strong LM fine-tuned on labels generated by a weaker model can consistently outperform its weak supervisor, we extend this idea to model alignment. In this work, we observe that the alignment behavior in weaker models can be effectively transferred to stronger models and even exhibit an amplification effect. Based on this insight, we propose a method called Weak-to-Strong Preference Optimization (WSPO), which achieves strong model alignment by learning the distribution differences before and after the alignment of the weak model. Experiments demonstrate that WSPO delivers outstanding performance, improving the win rate of Qwen2-7B-Instruct on Arena-Hard from 39.70 to 49.60, achieving a remarkable 47.04 length-controlled win rate on AlpacaEval 2, and scoring 7.33 on MT-bench. Our results suggest that using the weak model to elicit a strong model with a high alignment ability is feasible.

Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model

TL;DR

The results suggest that using the weak model to elicit a strong model with a high alignment ability is feasible and a method called Weak-to-Strong Preference Optimization (WSPO), which achieves strong model alignment by learning the distribution differences before and after the alignment of the weak model is proposed.

Abstract

Aligning language models (LMs) with human preferences has become a key area of research, enabling these models to meet diverse user needs better. Inspired by weak-to-strong generalization, where a strong LM fine-tuned on labels generated by a weaker model can consistently outperform its weak supervisor, we extend this idea to model alignment. In this work, we observe that the alignment behavior in weaker models can be effectively transferred to stronger models and even exhibit an amplification effect. Based on this insight, we propose a method called Weak-to-Strong Preference Optimization (WSPO), which achieves strong model alignment by learning the distribution differences before and after the alignment of the weak model. Experiments demonstrate that WSPO delivers outstanding performance, improving the win rate of Qwen2-7B-Instruct on Arena-Hard from 39.70 to 49.60, achieving a remarkable 47.04 length-controlled win rate on AlpacaEval 2, and scoring 7.33 on MT-bench. Our results suggest that using the weak model to elicit a strong model with a high alignment ability is feasible.

Paper Structure

This paper contains 96 sections, 4 theorems, 17 equations, 8 figures, 18 tables.

Key Result

Theorem 1

Under mild assumptions, all reward classes consistent with the Plackett-Luce (and Bradley-Terry in particular) models can be represented with the reparameterization $r(x, y)=\beta \log \frac{\pi(y \mid x)}{\pi_{\text{ref}}(y \mid x)}$ for some model $\pi(y \mid x)$ and a given reference model $\pi_{

Figures (8)

  • Figure 1: Pipeline for LM alignment. (1) Perform SFT on the pre-trained model using expert data. (2) Current approaches incorporate explicit or implicit reward mechanisms to fine-tune the model further, aligning its behavior with human preferences. (3) WSPO aligns strong models by utilizing the distributional differences observed before and after aligning the weak model.
  • Figure 2: Left. PPO and WSPO alignment methods vary in the length of generated sequences compared to the reference SFT model using greedy decoding. Right. PPO and WSPO alignment methods show variation in reward hits compared to the reference SFT model, using the top-$p$ sampling algorithm at different temperatures.
  • Figure 3: Left. Win rates computed by GPT-4o-mini for Anthropic-HH single-step dialogue at different temperatures. Right. The win rates for different sampling temperatures remain relatively stable throughout the training process. WSPO demonstrates consistent performance across varying sampling temperatures over time.
  • Figure 4: Left. The effect of weak model size on the sequence length generated by WSPO compared to the PPO using greedy decoding. Right. The impact of different $\gamma$ hyperparameters on WSPO in a single-turn dialogue analysis.
  • Figure 5: Left. Reward variation during PPO rraining of Qwen2-1.5B. Right. Loss variation during WSPO training of Qwen2-7B.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Lemma 1
  • Lemma 2
  • Proposition 1