Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents

Jiangming Shu; Yuxiang Zhang; Ye Ma; Xueyuan Lin; Jitao Sang

Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents

Jiangming Shu, Yuxiang Zhang, Ye Ma, Xueyuan Lin, Jitao Sang

TL;DR

This work proposes Valuate-as-Action (Evaluate-as-Action), which converts implicit retrieval quality assessment into an explicit action and enforces a coupled Search-to-Evaluate protocol so that each retrieval is immediately followed by a structured evaluation score, yielding process signals aligned with the interaction trajectory.

Abstract

Retrieval-augmented agents can query external evidence, yet their reliability in multi-step reasoning remains limited: noisy retrieval may derail multi-hop question answering, while outcome-only reinforcement learning provides credit signals that are too coarse to optimize intermediate steps. We propose \textsc{EvalAct} (Evaluate-as-Action), which converts implicit retrieval quality assessment into an explicit action and enforces a coupled Search-to-Evaluate protocol so that each retrieval is immediately followed by a structured evaluation score, yielding process signals aligned with the interaction trajectory. To leverage these signals, we introduce Process-Calibrated Advantage Rescaling (PCAR), a GRPO-based optimization method that rescales advantages at the segment level according to evaluation scores, emphasizing reliable segments while updating uncertain ones conservatively. Experiments on seven open-domain QA benchmarks show that \textsc{EvalAct} achieves the best average accuracy, with the largest gains on multi-hop tasks, and ablations verify that the explicit evaluation loop drives the primary improvements while PCAR provides consistent additional benefits.

Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents

TL;DR

Abstract

Paper Structure (40 sections, 11 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 40 sections, 11 equations, 3 figures, 4 tables, 1 algorithm.

Introduction
Methodology
Problem Formulation
Action space.
Observations.
EvalAct: Evaluate-as-Action
Inference-time control without oracle signals.
Reinforcement Learning with PCAR
Gated outcome reward.
GRPO.
PCAR: segment-wise advantage rescaling.
Experiments
Experimental Setup
Datasets.
Baselines.
...and 25 more sections

Figures (3)

Figure 1: Overview of EvalAct with PCAR. The agent follows a coupled Search$\rightarrow$Evaluate protocol, producing segment-wise self-evaluation scores $\{z_{i,k}\}$ that PCAR uses to rescale GRPO advantages.
Figure 2: Effect of SFT on Format Alignment. The Base model exhibits high tool parsing failure rates.
Figure 3: Training curves and ablation overview. (a) Training curves of EvalAct with 3B/7B backbones. (b) Model ablation across training variants. (c) Method ablation on removing the evaluation loop or PCAR. (d) Sensitivity to PCAR rescaling intensity.

Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents

TL;DR

Abstract

Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (3)