RL-finetuning LLMs from on- and off-policy data with a single algorithm

Yunhao Tang; Taco Cohen; David W. Zhang; Michal Valko; Rémi Munos

RL-finetuning LLMs from on- and off-policy data with a single algorithm

Yunhao Tang, Taco Cohen, David W. Zhang, Michal Valko, Rémi Munos

TL;DR

This work introduces Any-Generation Reward Optimization (AGRO), a unified RLHF-fine-tuning algorithm for LLMs that leverages generation consistency to enable learning from both on-policy and off-policy data. It derives variance-based loss functions from the consistency condition and provides gradient decompositions that include pathwise and likelihood-ratio components, ensuring convergence to the optimal policy $\pi^*$. The authors propose off-policy and on-policy AGRO variants, with token-level implementations and variance-reduction techniques, and demonstrate competitive gains on a mathematics reasoning benchmark (MATH) using an 8B Llama-3 model. They also compare against KL-regularized policy gradient, showing AGRO's superior convergence properties and KL-efficiency in off-policy settings, while discussing limitations and future work on stability and importance sampling for broader applicability.

Abstract

We introduce a novel reinforcement learning algorithm (AGRO, for Any-Generation Reward Optimization) for fine-tuning large-language models. AGRO leverages the concept of generation consistency, which states that the optimal policy satisfies the notion of consistency across any possible generation of the model. We derive algorithms that find optimal solutions via the sample-based policy gradient and provide theoretical guarantees on their convergence. Our experiments demonstrate the effectiveness of AGRO in both on-policy and off-policy settings, showing improved performance on the mathematical reasoning dataset over baseline algorithms.

RL-finetuning LLMs from on- and off-policy data with a single algorithm

TL;DR

Abstract

RL-finetuning LLMs from on- and off-policy data with a single algorithm

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (9)