Direct Advantage Regression: Aligning LLMs with Online AI Reward
Li He, He Zhao, Stephen Wan, Dadong Wang, Lina Yao, Tongliang Liu
TL;DR
Direct Advantage Regression (DAR) introduces a RL-free online alignment method that leverages online AI rewards to steer LLM policy improvement through a weighted supervised fine-tuning objective. DAR derives a dual-regularized, on-policy objective with coefficients $(\alpha,\beta)$ that balance a static reference policy and the current sampling policy, yielding a closed-form optimal policy and a practical $\nabla_\theta$ that combines an advantage-based weight with a regularization weight. Empirically, DAR outperforms OAIF and online RLHF across multiple datasets and models, requiring fewer online annotations and achieving higher human-AI agreement, while maintaining stable learning via weight clipping and Monte Carlo estimates. The work demonstrates that AI reward signals provide finer supervision than AI preferences and that the dual regularization framework enables robust, efficient alignment with state-of-the-art reward models, with potential extensions to multi-modal alignment and broader societal impact considerations.
Abstract
Online AI Feedback (OAIF) presents a promising alternative to Reinforcement Learning from Human Feedback (RLHF) by utilizing online AI preference in aligning language models (LLMs). However, the straightforward replacement of humans with AI deprives LLMs from learning more fine-grained AI supervision beyond binary signals. In this paper, we propose Direct Advantage Regression (DAR), a simple alignment algorithm using online AI reward to optimize policy improvement through weighted supervised fine-tuning. As an RL-free approach, DAR maintains theoretical consistency with online RLHF pipelines while significantly reducing implementation complexity and improving learning efficiency. Our empirical results underscore that AI reward is a better form of AI supervision consistently achieving higher human-AI agreement as opposed to AI preference. Additionally, evaluations using GPT-4-Turbo and MT-bench show that DAR outperforms both OAIF and online RLHF baselines.
