Table of Contents
Fetching ...

Direct Advantage Regression: Aligning LLMs with Online AI Reward

Li He, He Zhao, Stephen Wan, Dadong Wang, Lina Yao, Tongliang Liu

TL;DR

Direct Advantage Regression (DAR) introduces a RL-free online alignment method that leverages online AI rewards to steer LLM policy improvement through a weighted supervised fine-tuning objective. DAR derives a dual-regularized, on-policy objective with coefficients $(\alpha,\beta)$ that balance a static reference policy and the current sampling policy, yielding a closed-form optimal policy and a practical $\nabla_\theta$ that combines an advantage-based weight with a regularization weight. Empirically, DAR outperforms OAIF and online RLHF across multiple datasets and models, requiring fewer online annotations and achieving higher human-AI agreement, while maintaining stable learning via weight clipping and Monte Carlo estimates. The work demonstrates that AI reward signals provide finer supervision than AI preferences and that the dual regularization framework enables robust, efficient alignment with state-of-the-art reward models, with potential extensions to multi-modal alignment and broader societal impact considerations.

Abstract

Online AI Feedback (OAIF) presents a promising alternative to Reinforcement Learning from Human Feedback (RLHF) by utilizing online AI preference in aligning language models (LLMs). However, the straightforward replacement of humans with AI deprives LLMs from learning more fine-grained AI supervision beyond binary signals. In this paper, we propose Direct Advantage Regression (DAR), a simple alignment algorithm using online AI reward to optimize policy improvement through weighted supervised fine-tuning. As an RL-free approach, DAR maintains theoretical consistency with online RLHF pipelines while significantly reducing implementation complexity and improving learning efficiency. Our empirical results underscore that AI reward is a better form of AI supervision consistently achieving higher human-AI agreement as opposed to AI preference. Additionally, evaluations using GPT-4-Turbo and MT-bench show that DAR outperforms both OAIF and online RLHF baselines.

Direct Advantage Regression: Aligning LLMs with Online AI Reward

TL;DR

Direct Advantage Regression (DAR) introduces a RL-free online alignment method that leverages online AI rewards to steer LLM policy improvement through a weighted supervised fine-tuning objective. DAR derives a dual-regularized, on-policy objective with coefficients that balance a static reference policy and the current sampling policy, yielding a closed-form optimal policy and a practical that combines an advantage-based weight with a regularization weight. Empirically, DAR outperforms OAIF and online RLHF across multiple datasets and models, requiring fewer online annotations and achieving higher human-AI agreement, while maintaining stable learning via weight clipping and Monte Carlo estimates. The work demonstrates that AI reward signals provide finer supervision than AI preferences and that the dual regularization framework enables robust, efficient alignment with state-of-the-art reward models, with potential extensions to multi-modal alignment and broader societal impact considerations.

Abstract

Online AI Feedback (OAIF) presents a promising alternative to Reinforcement Learning from Human Feedback (RLHF) by utilizing online AI preference in aligning language models (LLMs). However, the straightforward replacement of humans with AI deprives LLMs from learning more fine-grained AI supervision beyond binary signals. In this paper, we propose Direct Advantage Regression (DAR), a simple alignment algorithm using online AI reward to optimize policy improvement through weighted supervised fine-tuning. As an RL-free approach, DAR maintains theoretical consistency with online RLHF pipelines while significantly reducing implementation complexity and improving learning efficiency. Our empirical results underscore that AI reward is a better form of AI supervision consistently achieving higher human-AI agreement as opposed to AI preference. Additionally, evaluations using GPT-4-Turbo and MT-bench show that DAR outperforms both OAIF and online RLHF baselines.

Paper Structure

This paper contains 42 sections, 1 theorem, 19 equations, 5 figures, 11 tables, 1 algorithm.

Key Result

Theorem 4.1

Under mild assumption, given a dual-constrained advantage (or reward) maximization objective such as the one in eq:dar_obj, with two KL coefficients being strictly positive, there exists a solution to the problem: where $Z(x)=\sum\limits_{y} \pi_{\textnormal{ref}}(y|x)^{\frac{\alpha}{\alpha+\beta}}\pi_t(y|x)^{\frac{\beta}{\alpha+\beta}}\exp\left(\frac{A(x,y)}{\alpha+\beta}\right)$ is the partitio

Figures (5)

  • Figure 1: Direct Advantage Regression with Online AI Reward. (Left) Using the reward labels provided by the LLM annotator, DAR increases the likelihood of each n-shot responses based on the calculated regression weight, so that the response of higher quality will have a higher probability to be sampled in the next iteration. (Right) The dual-constraint optimization objective of DAR: 1) the reference regularization prevents reward over-optimization, 2) the current sampling regularization ensures stable gradient updates in each iteration.
  • Figure 2: Contrasting KL regularization approaches in RL fine-tuning and on-policy RL. (Left) RL fine-tuning employs a fixed reference policy to mitigate reward hacking. (Right) On-policy RL methods (including regression-based) regularize with respect to the current sampling policy to ensure monotonic policy improvement.
  • Figure 3: Reference win rate curves of DAR with online AI reward against DPO with offline human preference, DAP methods (DPO, IPO, SLiC) with online AI preference, and RLHF methods (PPO, RLOO, Iterative SFT) with online AI reward. Win rates are averaged over 3 seeds and are judged by GPT-4-Turbo based on a 1k random test set for the tasks of TL;DR, Helpfulness and Harmlessness.
  • Figure 4: Performance of DAR on Helpfulness under different total regularization (a), and alpha ratio (b). Win rates are judged by Qwen2-72B-Instruct using a 1k random test set, while results are averaged over 3 seeds.
  • Figure 5: Reference win rate (a) and AI reward (b) of DAR on Helpfulness when varying weight clip and sampling size. Win rates and rewards are judged by Qwen2-72B-Instruct using a 1k random test set.

Theorems & Definitions (1)

  • Theorem 4.1