Table of Contents
Fetching ...

Think Twice: Branch-and-Rethink Reasoning Reward Model

Yizhu Jiao, Jiaqi Zeng, Julien Veron Vialard, Oleksii Kuchaiev, Jiawei Han, Olivier Delalleau

TL;DR

This work targets judgment diffusion in reward models by introducing BR-RM, a two-turn framework that first adaptively selects a few instance-critical evaluation dimensions and then performs a conditioned second-pass rethinking. Trained with GRPO on structured two-turn traces and a binary outcome reward, BR-RM integrates smoothly with standard RLHF pipelines. Empirical results on RewardBench, RM-Bench, and RMB show state-of-the-art performance and robust improvements over baselines, highlighting the value of targeted, second-look reasoning for safer and more reliable alignment. The approach offers practical benefits for scalable deployment and provides insights into how focused evaluation improves sensitivity to subtle errors across domains like factuality and safety.

Abstract

Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters most. We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks, making the approach compatible with standard RLHF pipelines. By converting all-at-once scoring into focused, second-look reasoning, BR-RM reduces judgment diffusion and improves sensitivity to subtle yet consequential errors while remaining practical and scalable. Experimental results demonstrate that our model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains.

Think Twice: Branch-and-Rethink Reasoning Reward Model

TL;DR

This work targets judgment diffusion in reward models by introducing BR-RM, a two-turn framework that first adaptively selects a few instance-critical evaluation dimensions and then performs a conditioned second-pass rethinking. Trained with GRPO on structured two-turn traces and a binary outcome reward, BR-RM integrates smoothly with standard RLHF pipelines. Empirical results on RewardBench, RM-Bench, and RMB show state-of-the-art performance and robust improvements over baselines, highlighting the value of targeted, second-look reasoning for safer and more reliable alignment. The approach offers practical benefits for scalable deployment and provides insights into how focused evaluation improves sensitivity to subtle errors across domains like factuality and safety.

Abstract

Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters most. We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks, making the approach compatible with standard RLHF pipelines. By converting all-at-once scoring into focused, second-look reasoning, BR-RM reduces judgment diffusion and improves sensitivity to subtle yet consequential errors while remaining practical and scalable. Experimental results demonstrate that our model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains.

Paper Structure

This paper contains 31 sections, 6 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparing token allocation between our method and a recent ReasonRM guo2025rrm on two subsets from RM-Bench. Our method adaptively focuses its generative analysis on a few critical dimensions for each task (e.g., Information Accuracy for chat), while the baseline spreads its tokens broadly across many criteria. More details in Appendix \ref{['sec:appendix_prelim']}.
  • Figure 2: Illustration of our proposed method, comparing with GenRM and ReasonRM. Our Branch-and-Rethink Reasoning Reward Model first performs Turn 1: Adaptive Branching, where it selects a few critical dimensions (e.g., Logical Reasoning, Computation Precision) to focus its evaluation and detect specific issues. This focused analysis then informs Turn 2: Branch-Conditioned Rethinking, where the model conducts a deeper, issue-driven re-thinking to arrive at a final reward judgment, which is then used for reinforcement learning.
  • Figure 3: Comparing token allocation between our method and a recent ReasonRM guo2025rrm on four subsets from RM-Bench. Our method adaptively focuses its generative analysis on a few critical dimensions for each task (e.g., Information Accuracy for chat), while the baseline spreads its tokens broadly across many criteria.
  • Figure 4: Prompt for adaptive branching.
  • Figure 5: Prompt for branch-conditioned rethinking.
  • ...and 1 more figures