Table of Contents
Fetching ...

Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Leitian Tao, Ilia Kulikov, Swarnadeep Saha, Tianlu Wang, Jing Xu, Sharon Li, Jason E Weston, Ping Yu

TL;DR

HERO tackles the brittleness of purely binary verifiers by integrating dense reward-model feedback with verifier signals in a structured framework. It introduces stratified normalization to anchor reward-model scores within verifier-defined groups and variance-aware weighting to emphasize informative, hard prompts, enabling stable learning across easy, hard, and mixed data regimes. Empirical results on multiple backbones and math-reasoning benchmarks show HERO consistently outperforms both RM-only and verifier-only baselines, with notable gains on hard-to-verify tasks. The approach preserves verifier stability while leveraging nuanced RM feedback, offering a practical path toward more reliable and scalable reasoning in LLMs.

Abstract

Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.

Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

TL;DR

HERO tackles the brittleness of purely binary verifiers by integrating dense reward-model feedback with verifier signals in a structured framework. It introduces stratified normalization to anchor reward-model scores within verifier-defined groups and variance-aware weighting to emphasize informative, hard prompts, enabling stable learning across easy, hard, and mixed data regimes. Empirical results on multiple backbones and math-reasoning benchmarks show HERO consistently outperforms both RM-only and verifier-only baselines, with notable gains on hard-to-verify tasks. The approach preserves verifier stability while leveraging nuanced RM feedback, offering a practical path toward more reliable and scalable reasoning in LLMs.

Abstract

Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.

Paper Structure

This paper contains 50 sections, 5 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Comparison of reward signals from different supervision sources. Reward Models (a) provide smooth but sometimes misaligned scores, occasionally assigning high values to incorrect responses and low values to correct ones. Rule-based rewards (b) enforce a strict binary (0–1) boundary: they rarely give false positives, but due to their stringent criteria, many predictions that are actually correct receive a reward of 0 simply because they fail to pass the rule. HERO (c) uses the rule as a gate, which significantly reduces false positives. At the same time, by integrating the reward model signal, HERO assigns higher reward scores to those cases that would have been false negatives under (b), resulting in more accurate and informative supervision.
  • Figure 2: (a) Impact of using positive and negative dense ranges. Dense negative rewards contribute more to stable learning than positive samples. (b) Effect of varying reward ranges under different training regimes. Smaller ranges perform best on verifiable tasks, while larger ranges benefit mixed settings by providing denser feedback.
  • Figure 3: GPT-4o filter prompt for TextBookReasoning.
  • Figure 4: Prompt Template for hard-to-verify tasks evaluation via GPT-4o.
  • Figure 5: Reward model qualification ability on mixed groups: (a) distribution of AUROC scores, (b) AUROC box plot, (c) cumulative distribution of AUROC, and (d) AUROC performance categories.
  • ...and 1 more figures