Table of Contents
Fetching ...

Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning

Zhiheng Xi, Jixuan Huang, Xin Guo, Boyang Hong, Dingwen Yang, Xiaoran Fan, Shuo Li, Zehui Chen, Junjie Ye, Siyu Yuan, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR

This paper tackles scalable oversight for language models by training critiquing models without relying on stronger supervision or test-time verifiers. It introduces Critique-RL, a two-stage online RL framework in which a critic learns to discriminate and provide helpful feedback, first via direct discriminability rewards and then via actor-refinement-based rewards with stabilizing regularization. Empirical results across multiple reasoning datasets (e.g., MATH, GSM8K, AQUA) and models (e.g., Qwen2.5-3B/7B) show substantial gains in both discrimination and final accuracy, with notable out-of-domain improvements and better compute efficiency at test time. The findings highlight the importance of decoupling and then jointly optimizing discriminability and helpfulness for scalable critique of model outputs.

Abstract

Training critiquing language models to assess and provide feedback on model outputs is a promising way to improve LLMs for complex reasoning tasks. However, existing approaches typically rely on stronger supervisors for annotating critique data. To address this, we propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision. Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly. We first reveal that relying solely on indirect reward signals from the actor's outputs for RL optimization often leads to unsatisfactory critics: while their helpfulness (i.e., providing constructive feedback) improves, the discriminability (i.e., determining whether a response is high-quality or not) remains poor, resulting in marginal performance gains. To overcome this, Critique-RL adopts a two-stage optimization strategy. In stage I, it reinforces the discriminability of the critic with direct rule-based reward signals; in stage II, it introduces indirect rewards based on actor refinement to improve the critic's helpfulness, while maintaining its discriminability via appropriate regularization. Extensive experiments across various tasks and models show that Critique-RL delivers substantial performance improvements. For example, it achieves a 9.02% gain on in-domain tasks and a 5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.

Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning

TL;DR

This paper tackles scalable oversight for language models by training critiquing models without relying on stronger supervision or test-time verifiers. It introduces Critique-RL, a two-stage online RL framework in which a critic learns to discriminate and provide helpful feedback, first via direct discriminability rewards and then via actor-refinement-based rewards with stabilizing regularization. Empirical results across multiple reasoning datasets (e.g., MATH, GSM8K, AQUA) and models (e.g., Qwen2.5-3B/7B) show substantial gains in both discrimination and final accuracy, with notable out-of-domain improvements and better compute efficiency at test time. The findings highlight the importance of decoupling and then jointly optimizing discriminability and helpfulness for scalable critique of model outputs.

Abstract

Training critiquing language models to assess and provide feedback on model outputs is a promising way to improve LLMs for complex reasoning tasks. However, existing approaches typically rely on stronger supervisors for annotating critique data. To address this, we propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision. Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly. We first reveal that relying solely on indirect reward signals from the actor's outputs for RL optimization often leads to unsatisfactory critics: while their helpfulness (i.e., providing constructive feedback) improves, the discriminability (i.e., determining whether a response is high-quality or not) remains poor, resulting in marginal performance gains. To overcome this, Critique-RL adopts a two-stage optimization strategy. In stage I, it reinforces the discriminability of the critic with direct rule-based reward signals; in stage II, it introduces indirect rewards based on actor refinement to improve the critic's helpfulness, while maintaining its discriminability via appropriate regularization. Extensive experiments across various tasks and models show that Critique-RL delivers substantial performance improvements. For example, it achieves a 9.02% gain on in-domain tasks and a 5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.

Paper Structure

This paper contains 47 sections, 11 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Left: Critique-RL achieves better performance and discrimination on MATH. Right: Inference compute scaling for Critique-RL, with @2k and @3k indicating sampling amounts that are 2 times and 3 times the x-axis value, respectively. Critique-RL improves the performance ceiling and is more compute-efficient.
  • Figure 2: Left: A case illustrating the two-player actor-critic interaction, including the original response from the actor, the critique from the critic, and the refinement from the Actor. Right: Overview of our method and its comparison with baseline RL. The snowflake icon on the Actor indicates that it is fixed, while the fire icon on the Critic indicates that it will be updated. Our method employs a two-stage RL process. It optimize discriminability of critique models in Stage I, and optimize helpfulness while maintaining discriminability in Stage II.
  • Figure 3: Training dynamics of preliminary experiments. "Acc@Dis Originally Correct" and "Acc@Dis Originally Incorrect" refer to the discrimination accuracy of originally correct and incorrect responses, respectively. Baselines using indirect reward signals to optimize helpfulness tend to exhibit overly conservative or aggressive behavior as the discriminability is not well optimized. In contrast, our Critique-RL optimizes discriminability in Stage I, and optimizes helpfulness while maintaining discriminability in Stage II, achieving better in $\textnormal{Acc@Refine}$, $\boldsymbol{\Delta^{c\to i}}$ and $\boldsymbol{\Delta^{i\to c}}$.
  • Figure 4: Results of critique-refinement of Critique-RL using Qwen2.5-3B.
  • Figure 5: Performance with and without the oracle verifier. When the oracle verifier is available, the model no longer needs to make discrimination and just needs to provides useful feedback. This allows us to evaluate the model's helpfulness more accurately.
  • ...and 4 more figures