Bridging Offline and Online Reinforcement Learning for LLMs

Jack Lanchantin; Angelica Chen; Janice Lan; Xian Li; Swarnadeep Saha; Tianlu Wang; Jing Xu; Ping Yu; Weizhe Yuan; Jason E Weston; Sainbayar Sukhbaatar; Ilia Kulikov

Bridging Offline and Online Reinforcement Learning for LLMs

Jack Lanchantin, Angelica Chen, Janice Lan, Xian Li, Swarnadeep Saha, Tianlu Wang, Jing Xu, Ping Yu, Weizhe Yuan, Jason E Weston, Sainbayar Sukhbaatar, Ilia Kulikov

TL;DR

The paper investigates how RL-based post-training for large language models performs as training shifts from offline to semi-online to online regimes, across both verifiable math problems and non-verifiable instruction tasks. It compares Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), highlighting that online and semi-online variants achieve similar convergence and substantially exceed offline performance, while multi-task reward signals improve results across task types. A key finding is that semi-online DPO often matches fully online performance while offering efficiency gains, suggesting it as a practical alternative for large-scale post-training. The work also shows that jointly optimizing verifiable and non-verifiable rewards yields robust improvements, enhancing cross-task generalization and informing scalable, multi-task RL strategies for LLM alignment.

Abstract

We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types.

Bridging Offline and Online Reinforcement Learning for LLMs

TL;DR

Abstract

Bridging Offline and Online Reinforcement Learning for LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)