Table of Contents
Fetching ...

Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning

Chris Samarinas, Haw-Shiuan Chang, Hamed Zamani

TL;DR

SLATE is proposed, a framework built on two complementary ideas: truncated step-level sampling and dense LLM-as-judge rewards, which replace heuristic scoring with a capable LLM evaluator that assesses the quality of each reasoning step, search query, and answer, providing richer and more reliable supervision.

Abstract

Training large language models to reason with search engines via reinforcement learning is hindered by a fundamental credit assignment problem: existing methods such as Search-R1 provide only a sparse outcome reward after an entire multi-step trajectory, making it infeasible to attribute success or failure to individual reasoning and retrieval decisions. Process-reward methods like StepSearch alleviate this by introducing step-level supervision, but rely on heuristic rewards such as TF-IDF overlap with gold documents, and still sample k complete trajectories per example, retaining high gradient variance. We propose SLATE, a framework built on two complementary ideas: (1) truncated step-level sampling, which generates k trajectories that share a common prefix and differ only at the next step, and (2) dense LLM-as-judge rewards, which replace heuristic scoring with a capable LLM evaluator that assesses the quality of each reasoning step, search query, and answer, providing richer and more reliable supervision. We theoretically prove that under the same dense reward structure, truncated sampling reduces the variance of advantage estimates by up to a factor of T compared to full-trajectory sampling for T-step trajectories, yielding lower-variance, better-targeted policy gradients. Experiments on seven QA benchmarks confirm that SLATE consistently outperforms both sparse-reward and process-reward baselines, with the largest gains on harder multi-hop tasks and smaller models.

Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning

TL;DR

SLATE is proposed, a framework built on two complementary ideas: truncated step-level sampling and dense LLM-as-judge rewards, which replace heuristic scoring with a capable LLM evaluator that assesses the quality of each reasoning step, search query, and answer, providing richer and more reliable supervision.

Abstract

Training large language models to reason with search engines via reinforcement learning is hindered by a fundamental credit assignment problem: existing methods such as Search-R1 provide only a sparse outcome reward after an entire multi-step trajectory, making it infeasible to attribute success or failure to individual reasoning and retrieval decisions. Process-reward methods like StepSearch alleviate this by introducing step-level supervision, but rely on heuristic rewards such as TF-IDF overlap with gold documents, and still sample k complete trajectories per example, retaining high gradient variance. We propose SLATE, a framework built on two complementary ideas: (1) truncated step-level sampling, which generates k trajectories that share a common prefix and differ only at the next step, and (2) dense LLM-as-judge rewards, which replace heuristic scoring with a capable LLM evaluator that assesses the quality of each reasoning step, search query, and answer, providing richer and more reliable supervision. We theoretically prove that under the same dense reward structure, truncated sampling reduces the variance of advantage estimates by up to a factor of T compared to full-trajectory sampling for T-step trajectories, yielding lower-variance, better-targeted policy gradients. Experiments on seven QA benchmarks confirm that SLATE consistently outperforms both sparse-reward and process-reward baselines, with the largest gains on harder multi-hop tasks and smaller models.
Paper Structure (46 sections, 2 theorems, 28 equations, 2 figures, 4 tables, 1 algorithm)

This paper contains 46 sections, 2 theorems, 28 equations, 2 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Let $\tau = (a_1, \ldots, a_T)$ be a $T$-step trajectory. Suppose the trajectory-level reward decomposes additively as $R(\tau) = \sum_{t=1}^T r_t(a_t, \tau_{<t})$, where $r_t$ is the step-$t$ reward. Assume the following conditions hold: 1) Non-negative future covariance: for each step $t$ and any where the left side is the expected (over prefixes) per-sample variance in the truncated estimator

Figures (2)

  • Figure 1: Comparison of GRPO (with full trajectory sampling) and our truncated step-level sampling. By fixing the prefix $\tau_{<t}$, all variation in the sampled group is localized to step $t$.
  • Figure 2: Training dynamics comparison on Qwen2.5-7B-Base. Slate converges faster and achieves a higher, more stable reward compared to Search-R1/GRPO and StepSearch/StePPO.

Theorems & Definitions (7)

  • Theorem 1: Variance Reduction via Truncated Sampling
  • proof : Proof Sketch
  • Remark 1: Credit Assignment
  • proof
  • Proposition 2: Sample Efficiency
  • proof
  • Remark 2: Bias-Variance Trade-off