Table of Contents
Fetching ...

Bridging Offline and Online Reinforcement Learning for LLMs

Jack Lanchantin, Angelica Chen, Janice Lan, Xian Li, Swarnadeep Saha, Tianlu Wang, Jing Xu, Ping Yu, Weizhe Yuan, Jason E Weston, Sainbayar Sukhbaatar, Ilia Kulikov

TL;DR

The paper investigates how RL-based post-training for large language models performs as training shifts from offline to semi-online to online regimes, across both verifiable math problems and non-verifiable instruction tasks. It compares Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), highlighting that online and semi-online variants achieve similar convergence and substantially exceed offline performance, while multi-task reward signals improve results across task types. A key finding is that semi-online DPO often matches fully online performance while offering efficiency gains, suggesting it as a practical alternative for large-scale post-training. The work also shows that jointly optimizing verifiable and non-verifiable rewards yields robust improvements, enhancing cross-task generalization and informing scalable, multi-task RL strategies for LLM alignment.

Abstract

We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types.

Bridging Offline and Online Reinforcement Learning for LLMs

TL;DR

The paper investigates how RL-based post-training for large language models performs as training shifts from offline to semi-online to online regimes, across both verifiable math problems and non-verifiable instruction tasks. It compares Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), highlighting that online and semi-online variants achieve similar convergence and substantially exceed offline performance, while multi-task reward signals improve results across task types. A key finding is that semi-online DPO often matches fully online performance while offering efficiency gains, suggesting it as a practical alternative for large-scale post-training. The work also shows that jointly optimizing verifiable and non-verifiable rewards yields robust improvements, enhancing cross-task generalization and informing scalable, multi-task RL strategies for LLM alignment.

Abstract

We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types.

Paper Structure

This paper contains 24 sections, 9 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: (left): Visualization of a single training step within our training pipeline, which can be used for any training objective such as DPO or GRPO. Syncing the weights allows rollout responses to be generated from the most recent model. (right): Progression from offline to online training, showing when model weight synchronizations occur at different train steps. Offline training only syncs before training starts, whereas online training syncs at every step.
  • Figure 2: Without syncing the reference model, response lengths of online DPO collapse when trained on verifiable tasks (left). This length collapse is also correlated with lower validation reward (right).
  • Figure 3: Logit entropy collapse in iterative and online training on verifiable tasks. Despite stable average length of rollouts during training (right), the average entropy of the next token distribution (left) decreases significantly during the training in all training regimes except the offline one.
  • Figure 4: LLM prompt used for verifiable task.
  • Figure 5: LLM prompt used for the non-verifiable task.
  • ...and 5 more figures