Reflect-RL: Two-Player Online RL Fine-Tuning for LMs

Runlong Zhou; Simon S. Du; Beibin Li

Reflect-RL: Two-Player Online RL Fine-Tuning for LMs

Runlong Zhou, Simon S. Du, Beibin Li

TL;DR

Reflect-RL introduces a two-player online RL fine-tuning framework for language models in multi-turn interactive environments, where a frozen reflection model assists a trainable policy LM to improve decision-making. The method combines SFT warm-up with online RL fine-tuning, leveraging reflection-based reasoning, negative example generation, single-prompt action enumeration, and curriculum learning. A new AutoExplore benchmark and related tasks (DangerousTaxi, ALFWorld) demonstrate that Reflect-RL outperforms SFT and online RL without reflection, with open-source GPT-2 XL and GPT-4 showing notable gains. The approach offers a scalable path for efficient online RL for LMs in complex, interactive domains and highlights implications for responsible AI and future multi-agent extensions.

Abstract

As language models (LMs) demonstrate their capabilities in various fields, their application to tasks requiring multi-round interactions has become increasingly popular. These tasks usually have complex dynamics, so supervised fine-tuning (SFT) on a limited offline dataset does not yield good performance. However, only a few works attempted to directly train the LMs within interactive decision-making environments. We aim to create an effective approach to fine-tune LMs with online reinforcement learning (RL) in these environments. We propose Reflect-RL, a two-player system to fine-tune an LM using SFT and online RL, where a frozen reflection model (player) assists the policy model (player). To generate data for the warm-up SFT stage, we use negative example generation to enhance the error-correction ability of the reflection model. Furthermore, we designed single-prompt action enumeration and applied curriculum learning to allow the policy model to learn more efficiently. Empirically, we verify that Reflect-RL outperforms SFT and online RL without reflection. Testing results indicate GPT-2 XL 1.56B fine-tuned with Reflect-RL outperforms larger open-source LMs, such as Mistral 7B. The benchmarks, dataset, and code involved in this work are publicly available: https://github.com/zhourunlong/Reflect-RL.

Reflect-RL: Two-Player Online RL Fine-Tuning for LMs

TL;DR

Abstract

Paper Structure (57 sections, 5 equations, 16 figures, 3 tables, 1 algorithm)

This paper contains 57 sections, 5 equations, 16 figures, 3 tables, 1 algorithm.

Introduction
Motivations
Contributions
Key Techniques:
New Benchmark for Online RL Fine-Tuning.
Paper Overview.
Related Works
Language models (LMs).
LM agents and multi-agent collaborations.
Fine-tuning of LMs.
LMs for interactive decision-making.
Preliminaries
Notations.
Markov decision processes (MDPs).
Policy optimization for MDPs.
...and 42 more sections

Figures (16)

Figure 1: Reflect-RL Pipeline. Solid lines represent the forward pass for both data generation and inference. Agents (in circular nodes) are language models capable of generating reflections and making decisions. Red dashed lines represent the loss and gradient calculation during the training periods: the reflection agent is trained with SFT, while the policy agent is trained first with SFT and then with online RLFT. Detailed illustrations for each stage can be found in \ref{['sec:illustration_pipeline']}.
Figure 2: Training success rates of different training methods with GPT-2 XL in the pickup curriculum of the DangerousTaxi environment. We compared different RL methods for 5000 iterations during RLFT. SFT with 5000 iterations would only achieve 7% success rate, hence only RL methods are shown.
Figure 3: Training success rate with and without negative examples in the AutoExplore setting, each assessed in a single run. When negative examples are excluded, the training process exhibits decreased speed and lacks smoothness.
Figure 4: Comparison of training success rates in the drop-off curriculum in the DangerousTaxi environment. The top two curves represent Reflect-RL; "w/ CL" means the experiment incorporates curriculum learning (CL) and is trained with the pickup curriculum. The bottom two dashed curves represent online RL without reflection. All single run.
Figure 5: Pipeline of Reflect-RL data generation.
...and 11 more figures

Reflect-RL: Two-Player Online RL Fine-Tuning for LMs

TL;DR

Abstract

Reflect-RL: Two-Player Online RL Fine-Tuning for LMs

Authors

TL;DR

Abstract

Table of Contents

Figures (16)