Think Twice: Branch-and-Rethink Reasoning Reward Model
Yizhu Jiao, Jiaqi Zeng, Julien Veron Vialard, Oleksii Kuchaiev, Jiawei Han, Olivier Delalleau
TL;DR
This work targets judgment diffusion in reward models by introducing BR-RM, a two-turn framework that first adaptively selects a few instance-critical evaluation dimensions and then performs a conditioned second-pass rethinking. Trained with GRPO on structured two-turn traces and a binary outcome reward, BR-RM integrates smoothly with standard RLHF pipelines. Empirical results on RewardBench, RM-Bench, and RMB show state-of-the-art performance and robust improvements over baselines, highlighting the value of targeted, second-look reasoning for safer and more reliable alignment. The approach offers practical benefits for scalable deployment and provides insights into how focused evaluation improves sensitivity to subtle errors across domains like factuality and safety.
Abstract
Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters most. We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks, making the approach compatible with standard RLHF pipelines. By converting all-at-once scoring into focused, second-look reasoning, BR-RM reduces judgment diffusion and improves sensitivity to subtle yet consequential errors while remaining practical and scalable. Experimental results demonstrate that our model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains.
