Table of Contents
Fetching ...

Leftover Lunch: Advantage-based Offline Reinforcement Learning for Language Models

Ashutosh Baheti, Ximing Lu, Faeze Brahman, Ronan Le Bras, Maarten Sap, Mark Riedl

TL;DR

The paper tackles the data-hungry and unstable nature of RLHF by introducing Advantage-Leftover Lunch RL (A-LoL), an offline policy-gradient framework that treats each language generation as a single action and uses a fixed reference value estimate to compute a positive-advantage signal from pre-collected data. It provides several variants (including per-token importance and KL-regularized versions) and emphasizes positive-advantage priority sampling to filter noisy data, achieving strong performance across multiple reward settings. Across four language-generation tasks, A-LoL rivals or surpasses online PPO and other offline baselines in reward, safety, and diversity, while demonstrating data efficiency and stability. The work also demonstrates multi-reward optimization capabilities, robustness to suboptimal data, and practical code release, offering a scalable alternative to RLHF for LM alignment with reduced computational overhead.

Abstract

Reinforcement Learning with Human Feedback (RLHF) is the most prominent method for Language Model (LM) alignment. However, RLHF is an unstable and data-hungry process that continually requires new high-quality LM-generated data for finetuning. We introduce Advantage-Leftover Lunch RL (A-LoL), a new class of offline policy gradient algorithms that enable RL training on any pre-existing data. By assuming the entire LM output sequence as a single action, A-LoL allows incorporating sequence-level classifiers or human-designed scoring functions as rewards. Subsequently, by using LM's value estimate, A-LoL only trains on positive advantage (leftover) data points, making it resilient to noise. Overall, A-LoL is an easy-to-implement, sample-efficient, and stable LM training recipe. We demonstrate the effectiveness of A-LoL and its variants with a set of four different language generation tasks. We compare against both online RL (PPO) and recent preference-based (DPO, PRO) and reward-based (GOLD) offline RL baselines. On the commonly-used RLHF benchmark, Helpful and Harmless Assistant (HHA), LMs trained with A-LoL methods achieve the highest diversity while also being rated more safe and helpful than the baselines according to humans. Additionally, in the remaining three tasks, A-LoL could optimize multiple distinct reward functions even when using noisy or suboptimal training data. We also release our experimental code. https://github.com/abaheti95/LoL-RL

Leftover Lunch: Advantage-based Offline Reinforcement Learning for Language Models

TL;DR

The paper tackles the data-hungry and unstable nature of RLHF by introducing Advantage-Leftover Lunch RL (A-LoL), an offline policy-gradient framework that treats each language generation as a single action and uses a fixed reference value estimate to compute a positive-advantage signal from pre-collected data. It provides several variants (including per-token importance and KL-regularized versions) and emphasizes positive-advantage priority sampling to filter noisy data, achieving strong performance across multiple reward settings. Across four language-generation tasks, A-LoL rivals or surpasses online PPO and other offline baselines in reward, safety, and diversity, while demonstrating data efficiency and stability. The work also demonstrates multi-reward optimization capabilities, robustness to suboptimal data, and practical code release, offering a scalable alternative to RLHF for LM alignment with reduced computational overhead.

Abstract

Reinforcement Learning with Human Feedback (RLHF) is the most prominent method for Language Model (LM) alignment. However, RLHF is an unstable and data-hungry process that continually requires new high-quality LM-generated data for finetuning. We introduce Advantage-Leftover Lunch RL (A-LoL), a new class of offline policy gradient algorithms that enable RL training on any pre-existing data. By assuming the entire LM output sequence as a single action, A-LoL allows incorporating sequence-level classifiers or human-designed scoring functions as rewards. Subsequently, by using LM's value estimate, A-LoL only trains on positive advantage (leftover) data points, making it resilient to noise. Overall, A-LoL is an easy-to-implement, sample-efficient, and stable LM training recipe. We demonstrate the effectiveness of A-LoL and its variants with a set of four different language generation tasks. We compare against both online RL (PPO) and recent preference-based (DPO, PRO) and reward-based (GOLD) offline RL baselines. On the commonly-used RLHF benchmark, Helpful and Harmless Assistant (HHA), LMs trained with A-LoL methods achieve the highest diversity while also being rated more safe and helpful than the baselines according to humans. Additionally, in the remaining three tasks, A-LoL could optimize multiple distinct reward functions even when using noisy or suboptimal training data. We also release our experimental code. https://github.com/abaheti95/LoL-RL
Paper Structure (38 sections, 6 equations, 8 figures, 14 tables, 1 algorithm)

This paper contains 38 sections, 6 equations, 8 figures, 14 tables, 1 algorithm.

Figures (8)

  • Figure 1: Illustration of Advantage-Leftover Lunch RL in practice. We first supervised finetune the reference policy ($\pi_{\text{ref}}$) on the training data as a precursor to A-LoL training. Then, an external reward model is employed to train the value estimate layer ($V_{\pi_{\text{ref}}}$) on frozen $\pi_{\text{ref}}$. Subsequently, using the reference policy values on $D_{tr}$, we can find instances with positive advantage. A-LoL then multiplies the positive advantage and importance weight with negative log likelihood to train target LM ($\pi_{\theta}$). Evaluation on $D_{test}$ shows LM trained with A-LoL achieves higher average reward and better distribution compared to the reference policy.
  • Figure 2: HHA validation trends of preference (left), reward (middle), and advantage-based (right) offline RL algorithms compared with negative log-likelihood (NLL) training over three random seeds.
  • Figure 3: PPO validation reward for three random seeds when trained on Helpful and Harmless Assistant Task (§ \ref{['subsec:hh_rlhf']}). After the initial improvement until 2000 steps, subsequent training shows very slow progress.
  • Figure 4: Test set distribution of Rewards achieved by Reference policy and top performing offline RL methods.
  • Figure 5: Reddit Response Generation Task: Comparing A-LoL seq. and reference policy test reward distribution for every scoring function. A-LoL seq. trained on downvoted comments almost matches the distribution with A-LoL seq. trained on upvoted comments on all scoring functions except the Probability of upvoting score.
  • ...and 3 more figures