SAIL: Self-Improving Efficient Online Alignment of Large Language Models

Mucong Ding; Souradip Chakraborty; Vibhu Agrawal; Zora Che; Alec Koppel; Mengdi Wang; Amrit Bedi; Furong Huang

SAIL: Self-Improving Efficient Online Alignment of Large Language Models

Mucong Ding, Souradip Chakraborty, Vibhu Agrawal, Zora Che, Alec Koppel, Mengdi Wang, Amrit Bedi, Furong Huang

TL;DR

This work reframes online RLHF for large language models as a bilevel optimization problem that couples reward learning with policy updates, addressing distribution shift via a unified framework. By leveraging reward-policy equivalence, the authors convert the bilevel objective into a tractable single-level Direct Preference Optimization with provable efficiency, while also relaxing the need for constant preference oracle access through self-improvement with offline data. They introduce three SAIL designs (DDP, DPP, DPR) that blend online and offline data to improve alignment with minimal computational overhead, and demonstrate superior performance over traditional DPO on multiple open datasets and modern LLMs. The approach yields stronger win-rates, better offline reward evaluation, and competitive MT-Bench scores, highlighting its practical impact for scalable online LLM alignment.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is a key method for aligning large language models (LLMs) with human preferences. However, current offline alignment approaches like DPO, IPO, and SLiC rely heavily on fixed preference datasets, which can lead to sub-optimal performance. On the other hand, recent literature has focused on designing online RLHF methods but still lacks a unified conceptual formulation and suffers from distribution shift issues. To address this, we establish that online LLM alignment is underpinned by bilevel optimization. By reducing this formulation to an efficient single-level first-order method (using the reward-policy equivalence), our approach generates new samples and iteratively refines model alignment by exploring responses and regulating preference labels. In doing so, we permit alignment methods to operate in an online and self-improving manner, as well as generalize prior online RLHF methods as special cases. Compared to state-of-the-art iterative RLHF methods, our approach significantly improves alignment performance on open-sourced datasets with minimal computational overhead.

SAIL: Self-Improving Efficient Online Alignment of Large Language Models

TL;DR

Abstract

Paper Structure (15 sections, 13 equations, 6 figures, 3 tables)

This paper contains 15 sections, 13 equations, 6 figures, 3 tables.

Introduction
Related Works
Problem Formulation
Existing Online RLHF Framework in the context of LLMs
Issue of Distribution shift in Iterative Online RLHF
Proposed approach: Efficient Bilevel Direct Preference Optimization
Relaxing the Preference Oracle Assumption: Toward Self-improving LLMs
Experiments
Comparing SAIL Designs
SAIL Applied to Start-of-the-art LLM Alignment
Conclusions
Experiment Implementation Details
Additional Experiment Details
Prompt Templates
Broader Impacts

Figures (6)

Figure 1: Left: This figure shows the standard three-step procedure of RLHF, which includes Step 0: supervised fine-tuning, Step 1: reward learning, and Step 2: policy alignment via fine-tuning. The dotted line indicates the entanglement between reward learning and policy tuning steps, which is the key part of online RLHF. In offline RLHF, this entanglement is usually ignored, leading to suboptimal solutions. Right. This figure provides a teaser of the benefits of our approach in comparison to the state of the art.
Figure 2: Possible compositions of the mixture distribution. Each distribution is characterized by the source of prompt, responses, and preferences, and is represented as a path in the diagram.
Figure 3: Relative performances and efficiency of 3 SAIL designs compared to DPO. The higher the better, see \ref{['sec:exp']} and \ref{['tab:sweep-summary']} for details.
Figure 4: Sweeping shows a favorable range of mixture weight and gradient coeff. combinations.
Figure 5: DPP requiring responses generation and DPR additionally requiring reward evaluation during training, both lead to larger time-overhead and smaller "best dist. mixture weight" to strike a balance between performance and efficiency.
...and 1 more figures

SAIL: Self-Improving Efficient Online Alignment of Large Language Models

TL;DR

Abstract

SAIL: Self-Improving Efficient Online Alignment of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)