RL-finetuning LLMs from on- and off-policy data with a single algorithm
Yunhao Tang, Taco Cohen, David W. Zhang, Michal Valko, Rémi Munos
TL;DR
This work introduces Any-Generation Reward Optimization (AGRO), a unified RLHF-fine-tuning algorithm for LLMs that leverages generation consistency to enable learning from both on-policy and off-policy data. It derives variance-based loss functions from the consistency condition and provides gradient decompositions that include pathwise and likelihood-ratio components, ensuring convergence to the optimal policy $\pi^*$. The authors propose off-policy and on-policy AGRO variants, with token-level implementations and variance-reduction techniques, and demonstrate competitive gains on a mathematics reasoning benchmark (MATH) using an 8B Llama-3 model. They also compare against KL-regularized policy gradient, showing AGRO's superior convergence properties and KL-efficiency in off-policy settings, while discussing limitations and future work on stability and importance sampling for broader applicability.
Abstract
We introduce a novel reinforcement learning algorithm (AGRO, for Any-Generation Reward Optimization) for fine-tuning large-language models. AGRO leverages the concept of generation consistency, which states that the optimal policy satisfies the notion of consistency across any possible generation of the model. We derive algorithms that find optimal solutions via the sample-based policy gradient and provide theoretical guarantees on their convergence. Our experiments demonstrate the effectiveness of AGRO in both on-policy and off-policy settings, showing improved performance on the mathematical reasoning dataset over baseline algorithms.
