Table of Contents
Fetching ...

Towards Better RL Training Data Utilization via Second-Order Rollout

Zhe Yang, Yudong Wang, Rang Li, Zhifang Sui

TL;DR

This work introduces the concept of second-order rollout (generating multiple critiques for a response) and proposes a unified framework for jointly training generation and critique capabilities and uncovers several insightful findings regarding second-order rollout and critique training.

Abstract

Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple responses for a question), and we argue that this approach fails to fully exploit the potential of training data because of the neglect of critique capability training. To tackle this problem, we further introduce the concept of second-order rollout (generating multiple critiques for a response) and propose a unified framework for jointly training generation and critique capabilities. Extensive experiments across various models and datasets demonstrate that our approach can utilize training data more effectively than vanilla RL and achieve better performance under the same training data. Additionally, we uncover several insightful findings regarding second-order rollout and critique training, such as the importance of label balance in critique training and the noise problem of outcome-based rewards, which can be mitigated through sampling techniques. Our work offers a preliminary exploration of dynamic data augmentation and joint generation-critique training in RL, providing meaningful inspiration for the further advancement of RL training

Towards Better RL Training Data Utilization via Second-Order Rollout

TL;DR

This work introduces the concept of second-order rollout (generating multiple critiques for a response) and proposes a unified framework for jointly training generation and critique capabilities and uncovers several insightful findings regarding second-order rollout and critique training.

Abstract

Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple responses for a question), and we argue that this approach fails to fully exploit the potential of training data because of the neglect of critique capability training. To tackle this problem, we further introduce the concept of second-order rollout (generating multiple critiques for a response) and propose a unified framework for jointly training generation and critique capabilities. Extensive experiments across various models and datasets demonstrate that our approach can utilize training data more effectively than vanilla RL and achieve better performance under the same training data. Additionally, we uncover several insightful findings regarding second-order rollout and critique training, such as the importance of label balance in critique training and the noise problem of outcome-based rewards, which can be mitigated through sampling techniques. Our work offers a preliminary exploration of dynamic data augmentation and joint generation-critique training in RL, providing meaningful inspiration for the further advancement of RL training
Paper Structure (35 sections, 9 equations, 9 figures, 5 tables)

This paper contains 35 sections, 9 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: A demonstration of first/second-order rollout. The policy model generates multiple responses for a question in first-order rollout, and generates multiple critiques for a response in second-order rollout.
  • Figure 2: Flowchart of a single training step in GC-RL. First, a batch of questions is sampled from the training data, and multiple responses are generated (first-order rollout). Then, without replacement, a batch of <question, response> pairs is sampled from the Question-Response Data Cache, and multiple critiques are generated (second-order rollout). These rollouts are combined and utilized jointly to update the policy model. In addition, the Question-Response Data Cache is maintained by processing the first-order rollout through a Data Filter and adding the filtered data into the cache.
  • Figure 3: A comparison of model performance of Qwen2.5-7B with/without reward denoising strategy on Math-500. In both GC-RL and C-RL settings, reward denoising can improve model performance on both generation and critique capabilities.
  • Figure 4: Performance of Qwen2.5-7B on Math-500 under GC-RL and C-RL settings with static and dynamic critique training data. Dynamic data outperforms static data in the GC-RL setting, while the opposite holds for C-RL.
  • Figure 5: A comparison of critique performance of Qwen2.5-7B with different reward function on Math-500. Compared to baseline $R(c)$, $R_w(c)$ leads to a higher precision while $R_r(c)$ generates a higher recall.
  • ...and 4 more figures