Table of Contents
Fetching ...

LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

Zecheng Tang, Baibei Ji, Quantong Qiu, Haitian Wang, Xiaobo Liang, Juntao Li, Min Zhang

TL;DR

This paper identifies a critical gap in reward modeling for long-context scenarios and introduces Long-RewardBench, a benchmark covering up to 128K tokens and two task formats (Pair and Best-of-N). It proposes a general multi-stage training strategy (LongRM) with Short-to-Long dataset synthesis and consistency-aligned RL (LOGO/DPO) to scale arbitrary models into robust long-context reward models while preserving short-context performance. Empirical results show that 8B LongRMs can outperform much larger baselines (up to 70B) and match proprietary Gemini 2.5 Pro on long-context evaluation, with additional gains in practical SFT-style settings through self-distillation guided by LongRM. The work also extends the approach to discriminative reward models and provides ablations demonstrating generalization and real-world utility, highlighting long-context grounding as a tractable objective for reward modeling.

Abstract

Reward model (RM) plays a pivotal role in aligning large language model (LLM) with human preferences. As real-world applications increasingly involve long history trajectories, e.g., LLM agent, it becomes indispensable to evaluate whether a model's responses are not only high-quality but also grounded in and consistent with the provided context. Yet, current RMs remain confined to short-context settings and primarily focus on response-level attributes (e.g., safety or helpfulness), while largely neglecting the critical dimension of long context-response consistency. In this work, we introduce Long-RewardBench, a benchmark specifically designed for long-context RM evaluation, featuring both Pairwise Comparison and Best-of-N tasks. Our preliminary study reveals that even state-of-the-art generative RMs exhibit significant fragility in long-context scenarios, failing to maintain context-aware preference judgments. Motivated by the analysis of failure patterns observed in model outputs, we propose a general multi-stage training strategy that effectively scales arbitrary models into robust Long-context RMs (LongRMs). Experiments show that our approach not only substantially improves performance on long-context evaluation but also preserves strong short-context capability. Notably, our 8B LongRM outperforms much larger 70B-scale baselines and matches the performance of the proprietary Gemini 2.5 Pro model.

LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

TL;DR

This paper identifies a critical gap in reward modeling for long-context scenarios and introduces Long-RewardBench, a benchmark covering up to 128K tokens and two task formats (Pair and Best-of-N). It proposes a general multi-stage training strategy (LongRM) with Short-to-Long dataset synthesis and consistency-aligned RL (LOGO/DPO) to scale arbitrary models into robust long-context reward models while preserving short-context performance. Empirical results show that 8B LongRMs can outperform much larger baselines (up to 70B) and match proprietary Gemini 2.5 Pro on long-context evaluation, with additional gains in practical SFT-style settings through self-distillation guided by LongRM. The work also extends the approach to discriminative reward models and provides ablations demonstrating generalization and real-world utility, highlighting long-context grounding as a tractable objective for reward modeling.

Abstract

Reward model (RM) plays a pivotal role in aligning large language model (LLM) with human preferences. As real-world applications increasingly involve long history trajectories, e.g., LLM agent, it becomes indispensable to evaluate whether a model's responses are not only high-quality but also grounded in and consistent with the provided context. Yet, current RMs remain confined to short-context settings and primarily focus on response-level attributes (e.g., safety or helpfulness), while largely neglecting the critical dimension of long context-response consistency. In this work, we introduce Long-RewardBench, a benchmark specifically designed for long-context RM evaluation, featuring both Pairwise Comparison and Best-of-N tasks. Our preliminary study reveals that even state-of-the-art generative RMs exhibit significant fragility in long-context scenarios, failing to maintain context-aware preference judgments. Motivated by the analysis of failure patterns observed in model outputs, we propose a general multi-stage training strategy that effectively scales arbitrary models into robust Long-context RMs (LongRMs). Experiments show that our approach not only substantially improves performance on long-context evaluation but also preserves strong short-context capability. Notably, our 8B LongRM outperforms much larger 70B-scale baselines and matches the performance of the proprietary Gemini 2.5 Pro model.

Paper Structure

This paper contains 61 sections, 3 equations, 17 figures, 8 tables.

Figures (17)

  • Figure 1: Construction and task format of Long-RewardBench. Specifically, Long-RewardBench contains 6 tasks and 2 task formats, i.e., Pairwise Comparison (Pair) and Best-of-N (BoN).
  • Figure 2: Evaluation results of existing GenRMs on Long-RewardBench. For ease of analysis, we evaluate RMs on the Pair task under 2 scenarios: (a) Single-document QA and (b) Synthetic long-form reasoning. We report the evaluation accuracy across different context length intervals.
  • Figure 3: Results of conventional context scaling methods on Long-RewardBench and RewardBench.
  • Figure 4: Illustration of two prevalent failure patterns of GenRMs on Long-RewardBench.
  • Figure 5: Illustration of the multi-stage training strategy of LongRM (top row) and the corresponding data synthesis process for each stage (bottom row).
  • ...and 12 more figures