Table of Contents
Fetching ...

Simplify RLHF as Reward-Weighted SFT: A Variational Method

Yuhao Du, Zhuo Li, Pengyu Cheng, Zhihong Chen, Yuejiao Xie, Xiang Wan, Anningzhe Gao

TL;DR

This paper tackles the instability and heavy computation in RLHF by recasting policy alignment as variational inference. It derives a reward-weighted, positive-measure objective via KL divergence minimization, yielding a stable, clipping-free SFT loss with weights proportional to $\pi_{\text{ref}}(\boldsymbol{y}|\boldsymbol{x}) \exp(r(\boldsymbol{x},\boldsymbol{y})/\lambda)$ and an in-batch estimator for the normalization term $Z(\boldsymbol{x})$. The approach, VAR, demonstrates improved stability and competitive alignment performance on HHA and generative benchmarks across multiple model scales, often matching or exceeding Direct Preference Optimization and offline RLHF baselines. The key practical contribution is a scalable, offline, single-SFT-like step that preserves generation diversity while improving helpfulness and harmlessness, enabling more robust LLM alignment in real-world settings. The work suggests promising directions for online variants and broader task coverage, with careful attention to reward-model biases and ethical deployment implications.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. However, RLHF has been continuously challenged by its high complexity in implementation and computation consumption. Even with recent simplifications, such as Direct Preference Optimization (DPO) and Advantage Leftover Lunch (A-LoL), the problems of over-fitting and training instability remain hindering the alignment process from the expected optimal performance. To address the existing challenges, we propose a novel simplification of RLHF from the perspective of variational inference, called $\textbf{V}$ariational $\textbf{A}$lignment with $\textbf{R}$e-weighting ($\textbf{VAR}$). More specifically, by directly minimizing the distribution gap between the learning LLM policy and the optimal solution of RLHF, we transform the alignment objective into a reward-driven re-weighted supervised fine-tuning (SFT) form, which only requires minor adjustment on the SFT loss to obtain noticeable improvement on training stability and effectiveness. On comprehensive alignment and generation benchmarks, our VAR method has numerically achieved competitive performance in LLM alignment helpfulness and harmlessness.

Simplify RLHF as Reward-Weighted SFT: A Variational Method

TL;DR

This paper tackles the instability and heavy computation in RLHF by recasting policy alignment as variational inference. It derives a reward-weighted, positive-measure objective via KL divergence minimization, yielding a stable, clipping-free SFT loss with weights proportional to and an in-batch estimator for the normalization term . The approach, VAR, demonstrates improved stability and competitive alignment performance on HHA and generative benchmarks across multiple model scales, often matching or exceeding Direct Preference Optimization and offline RLHF baselines. The key practical contribution is a scalable, offline, single-SFT-like step that preserves generation diversity while improving helpfulness and harmlessness, enabling more robust LLM alignment in real-world settings. The work suggests promising directions for online variants and broader task coverage, with careful attention to reward-model biases and ethical deployment implications.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. However, RLHF has been continuously challenged by its high complexity in implementation and computation consumption. Even with recent simplifications, such as Direct Preference Optimization (DPO) and Advantage Leftover Lunch (A-LoL), the problems of over-fitting and training instability remain hindering the alignment process from the expected optimal performance. To address the existing challenges, we propose a novel simplification of RLHF from the perspective of variational inference, called ariational lignment with e-weighting (). More specifically, by directly minimizing the distribution gap between the learning LLM policy and the optimal solution of RLHF, we transform the alignment objective into a reward-driven re-weighted supervised fine-tuning (SFT) form, which only requires minor adjustment on the SFT loss to obtain noticeable improvement on training stability and effectiveness. On comprehensive alignment and generation benchmarks, our VAR method has numerically achieved competitive performance in LLM alignment helpfulness and harmlessness.

Paper Structure

This paper contains 38 sections, 2 theorems, 38 equations, 4 figures, 6 tables, 1 algorithm.

Key Result

Theorem 2.1

For any policy $\pi_\theta({\bm{y}}|{\bm{x}})$ satisfying $\sum_y \pi_\theta({\bm{y}}|{\bm{x}})=1$ and weights $w({\bm{x}},{\bm{y}})>0$, the weighted SFT loss satisfies: with equality if and only if $\pi_\theta({\bm{y}}|{\bm{x}}) = \delta_{y=y^*}$ where $\delta_{y=y^*}$ is the optimal policy when $y^* = \arg\max_y w(x,y)$.

Figures (4)

  • Figure 1: GPT-4 evaluation results on the HHA test set for the Llama series, reporting average win rates. Error bars are calculated across three different random seeds.
  • Figure 2: Average validation reward during the training process for (a) Llama3.2-1B, (b) Llama3.2-3B, and (c) Llama3.1-8B on the OffsetBias dataset, comparing DPO and our method.
  • Figure 3: Average output length for aligned Llama3.1-8B on the HHA testset.
  • Figure 4: GPT-4 evaluation results on the HHA test set for the Qwen series, reporting average win rates. Error bars are calculated across three different random seeds.

Theorems & Definitions (5)

  • Theorem 2.1
  • proof
  • Theorem 2.2
  • proof
  • proof