Table of Contents
Fetching ...

Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks

Peiran Xu, Zhuohao Li, Xiaoying Xing, Guannan Zhang, Debiao Li, Kunyu Shi

TL;DR

This paper tackles the challenge of training LLM-based agents to perform non-verifiable, multi-turn tool-use tasks by introducing Principle Process Reward (PPR), a hybrid reinforcement learning framework that couples principled process evaluation with outcome verification. A dedicated Principle Process Reward Model (PPRM) grounds step-level judgments in explicit principles, while Reward Normalization (ReNorm) calibrates process and outcome signals to stabilize learning over long trajectories. Empirical results show state-of-the-art performance on in-domain and out-of-domain QA tasks, and a new NVProcessBench benchmark demonstrates the effectiveness of process-based rewards in non-verifiable settings. The approach offers a scalable, interpretable path toward safer and more reliable agentic reasoning in tool-using LLMs.

Abstract

Large Language Models (LLMs) increasingly rely on external tools such as search engines to solve complex agentic tasks that require reasoning and external knowledge retrieval. Recently, reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in advancing capabilities of LLMs by rewarding the final answers via outcome rewards. While straightforward to supervise, outcome rewards only provide sparse signals and delayed feedback, which limits their effectiveness on long trajectories. Process rewards address this by evaluating intermediate steps, providing fine-grained supervision and encouraging grounded problem solving. However, it is notoriously hard to annotate step-wise labels, especially in non-verifiable process without "golden" answers. Furthermore, step-wise judgment requires the balance between local quality with contribution to the final outcome, as optimizing towards higher process reward may not always align with better final outcomes. To address the above challenges, we introduce Principle Process Reward (PPR), an RL approach that unifies principled step-level assessment and outcome verification. We train a principle-based reward model to improve the transparency and reliability of process evaluation, and further introduce a Reward Normalization (ReNorm) strategy to calibrate outcome and process rewards. Experiment results show that PPR achieves state-of-the-art performance across a wide range of benchmarks, demonstrating its impressive robustness and generalization. Our code and model collection is available in this link.

Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks

TL;DR

This paper tackles the challenge of training LLM-based agents to perform non-verifiable, multi-turn tool-use tasks by introducing Principle Process Reward (PPR), a hybrid reinforcement learning framework that couples principled process evaluation with outcome verification. A dedicated Principle Process Reward Model (PPRM) grounds step-level judgments in explicit principles, while Reward Normalization (ReNorm) calibrates process and outcome signals to stabilize learning over long trajectories. Empirical results show state-of-the-art performance on in-domain and out-of-domain QA tasks, and a new NVProcessBench benchmark demonstrates the effectiveness of process-based rewards in non-verifiable settings. The approach offers a scalable, interpretable path toward safer and more reliable agentic reasoning in tool-using LLMs.

Abstract

Large Language Models (LLMs) increasingly rely on external tools such as search engines to solve complex agentic tasks that require reasoning and external knowledge retrieval. Recently, reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in advancing capabilities of LLMs by rewarding the final answers via outcome rewards. While straightforward to supervise, outcome rewards only provide sparse signals and delayed feedback, which limits their effectiveness on long trajectories. Process rewards address this by evaluating intermediate steps, providing fine-grained supervision and encouraging grounded problem solving. However, it is notoriously hard to annotate step-wise labels, especially in non-verifiable process without "golden" answers. Furthermore, step-wise judgment requires the balance between local quality with contribution to the final outcome, as optimizing towards higher process reward may not always align with better final outcomes. To address the above challenges, we introduce Principle Process Reward (PPR), an RL approach that unifies principled step-level assessment and outcome verification. We train a principle-based reward model to improve the transparency and reliability of process evaluation, and further introduce a Reward Normalization (ReNorm) strategy to calibrate outcome and process rewards. Experiment results show that PPR achieves state-of-the-art performance across a wide range of benchmarks, demonstrating its impressive robustness and generalization. Our code and model collection is available in this link.

Paper Structure

This paper contains 22 sections, 11 equations, 5 figures, 9 tables, 1 algorithm.

Figures (5)

  • Figure 2: Overview of PPR Rollout: Given a user query $q$, the policy model interacts with a search engine and produces multi-step conversations. At $t$-th step, it generates a reasoning trace $R_{t}$ and a search query $S_{t}$ if applicable. The retrieved information $\mathrm{info}_{t}$ is appended to the context for subsequent steps. PRM will assign step-wise credits while ORM will assess the final answer $O$.
  • Figure 3: Reward tensor. The PPRM generates step-wise rewards $\hat{r}_{p_i}$ by dynamically selecting relevant principles from the principle set according to the given context. For example, in the first turn PPRM selects #1, #2, and #3, while in the second turn it selects #1, #3, and #4. A rule-based outcome reward $r_{o}$ is computed from the final response. ReNorm $f$ normalizes $\hat{r}_{p_i}$ and $r_{o}$ to obtain the final step reward $r_{p_i}$. All rewards are inserted into the reward tensor at their corresponding positions, with all remaining entries set to zero.
  • Figure 4: Training rewards Qwen2.5-3B-Instruct on the NQ dataset. (a) PPR vs. Baselines: PPR consistently achieves the highest rewards, while ORM-based (Search-R1) and PRM-based (Qwen3-8B, Skywork-V2) baselines collapse before 300 steps or exhibit severe fluctuations; (b) ReNorm vs. baselines: ReNorm outperforms other normalization methods in both performance and stability; (c) Principle vs. w/o Principle: PPR demonstrates the principle's effectiveness in PRM design.
  • Figure 5: Analysis on (a) Valid Judge Rate: PPR consistently achieves perfectly following designed format, while baselines achieve 0.7 to 0.9 success rates. (b) # of Valid Search: PPR avoids generating redundant or invalid queries and stabilize over training. (c) Response Length: The average response length converges over training.
  • Figure 6: (a) Qwen2.5-3B vs. Qwen2.5-7B: PPR has demonstrated scalability regarding the model sizes. (b) Qwen2.5-3B vs. Qwen2.5-3B-Instruct: Base models initially starts worse but converges similar to Instruct models. (c) Outcome vs. Process Reward: Process and outcome reward can align with each other in PPR, indicating process supervision follows final judgment and benefits learning.

Theorems & Definitions (4)

  • proof
  • proof
  • proof
  • proof