Table of Contents
Fetching ...

Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models

Jiayun Wu, Peixu Hou, Shan Qu, Peng Zhang, Ning Gu, Tun Lu

Abstract

Reward models (RMs) are critical for aligning Large Language Models via Reinforcement Learning from Human Feedback (RLHF). While Generative Reward Models (GRMs) achieve superior accuracy through chain-of-thought (CoT) reasoning, they incur substantial computational costs. Conversely, Scalar Reward Models (SRMs) offer efficiency but suffer from limited performance and adaptability in complex scenarios. We introduce Fast-Slow Thinking Reward Models (F/S-RM), a hybrid RM architecture inspired by Dual Process Theory. It trains a single model to integrate two distinct reward paradigms: first-token prediction as a scalar score (fast thinking) and CoT-based judgment (slow thinking), regulated by a dual-confidence activation mechanism that determines when to activate slow thinking. F/S-RM achieves a 1.2% relative performance improvement over state-of-the-art models while reducing token consumption by 20.8%. Code and data will be publicly available.

Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models

Abstract

Reward models (RMs) are critical for aligning Large Language Models via Reinforcement Learning from Human Feedback (RLHF). While Generative Reward Models (GRMs) achieve superior accuracy through chain-of-thought (CoT) reasoning, they incur substantial computational costs. Conversely, Scalar Reward Models (SRMs) offer efficiency but suffer from limited performance and adaptability in complex scenarios. We introduce Fast-Slow Thinking Reward Models (F/S-RM), a hybrid RM architecture inspired by Dual Process Theory. It trains a single model to integrate two distinct reward paradigms: first-token prediction as a scalar score (fast thinking) and CoT-based judgment (slow thinking), regulated by a dual-confidence activation mechanism that determines when to activate slow thinking. F/S-RM achieves a 1.2% relative performance improvement over state-of-the-art models while reducing token consumption by 20.8%. Code and data will be publicly available.
Paper Structure (26 sections, 13 equations, 6 figures, 8 tables)

This paper contains 26 sections, 13 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Structural and Performance Comparison of SRMs and GRMs. (a) Architectural designs and output formats—SRMs produce scalar scores while GRMs generate reasoning chains. (b) Performance across three difficulty levels on RM-Bench. GRMs excel on Hard cases while SRMs dominate Easy cases, revealing complementary strengths.
  • Figure 2: Architecture and training workflow of F/S-RM. (a) Adaptive reasoning task. (b) Reward signal generation modeled as an adaptive reasoning chain. (c) Two-stage training pipeline for optimizing fast-thinking judgment and slow-thinking CoT reasoning.
  • Figure 3: Dual-Confidence visualization on RewardBench (Qwen3-8B). Vertical axis ($Z$): hybrid accuracy; surface color: token savings percentage; horizontal axes: $C^I$ (left) and $C^T$ (right). Deeper yellow indicates higher savings; height indicates better performance.
  • Figure 4: Performance-efficiency trade-off across threshold settings. Thresholds are set at minimum, 25th percentile, mean (main results), and 75th percentile of $C^I$ and $C^T$ from correctly predicted training samples. Left y-axis: hybrid accuracy; right y-axis: token savings.
  • Figure 5: Performance breakdown across domains and difficulty levels on RM-Bench. Left y-axis (blue bars): accuracy under Fast, Slow, and Hybrid modes. Right y-axis (green bars): Fast Rate and Token Saving percentages for Hybrid mode.
  • ...and 1 more figures