Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models

Jiayun Wu; Peixu Hou; Shan Qu; Peng Zhang; Ning Gu; Tun Lu

Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models

Jiayun Wu, Peixu Hou, Shan Qu, Peng Zhang, Ning Gu, Tun Lu

Abstract

Reward models (RMs) are critical for aligning Large Language Models via Reinforcement Learning from Human Feedback (RLHF). While Generative Reward Models (GRMs) achieve superior accuracy through chain-of-thought (CoT) reasoning, they incur substantial computational costs. Conversely, Scalar Reward Models (SRMs) offer efficiency but suffer from limited performance and adaptability in complex scenarios. We introduce Fast-Slow Thinking Reward Models (F/S-RM), a hybrid RM architecture inspired by Dual Process Theory. It trains a single model to integrate two distinct reward paradigms: first-token prediction as a scalar score (fast thinking) and CoT-based judgment (slow thinking), regulated by a dual-confidence activation mechanism that determines when to activate slow thinking. F/S-RM achieves a 1.2% relative performance improvement over state-of-the-art models while reducing token consumption by 20.8%. Code and data will be publicly available.

Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models

Abstract

Paper Structure (26 sections, 13 equations, 6 figures, 8 tables)

This paper contains 26 sections, 13 equations, 6 figures, 8 tables.

Introduction
Preliminary
Fast-Slow Thinking RM
Fast Thinking as First-Token Prediction
RL for Slow Thinking
Dual-Confidence Activation Mechanism
Experiments
Main Results
Ablation Study
Efficacy of Dual-Confidence Activation
Domain-Specific Analysis
Related Work
Scalar Reward Models
Generative Reward Models with Reasoning
Conclusion
...and 11 more sections

Figures (6)

Figure 1: Structural and Performance Comparison of SRMs and GRMs. (a) Architectural designs and output formats—SRMs produce scalar scores while GRMs generate reasoning chains. (b) Performance across three difficulty levels on RM-Bench. GRMs excel on Hard cases while SRMs dominate Easy cases, revealing complementary strengths.
Figure 2: Architecture and training workflow of F/S-RM. (a) Adaptive reasoning task. (b) Reward signal generation modeled as an adaptive reasoning chain. (c) Two-stage training pipeline for optimizing fast-thinking judgment and slow-thinking CoT reasoning.
Figure 3: Dual-Confidence visualization on RewardBench (Qwen3-8B). Vertical axis ($Z$): hybrid accuracy; surface color: token savings percentage; horizontal axes: $C^I$ (left) and $C^T$ (right). Deeper yellow indicates higher savings; height indicates better performance.
Figure 4: Performance-efficiency trade-off across threshold settings. Thresholds are set at minimum, 25th percentile, mean (main results), and 75th percentile of $C^I$ and $C^T$ from correctly predicted training samples. Left y-axis: hybrid accuracy; right y-axis: token savings.
Figure 5: Performance breakdown across domains and difficulty levels on RM-Bench. Left y-axis (blue bars): accuracy under Fast, Slow, and Hybrid modes. Right y-axis (green bars): Fast Rate and Token Saving percentages for Hybrid mode.
...and 1 more figures

Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models

Abstract

Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models

Authors

Abstract

Table of Contents

Figures (6)