Table of Contents
Fetching ...

E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

Shuvom Sadhuka, Drew Prinster, Clara Fannjiang, Gabriele Scalia, Aviv Regev, Hanchen Wang

TL;DR

E-valuator tackles the lack of false-alarm guarantees in verifier-based agent evaluation by framing trajectory success as a sequential hypothesis test. It learns density ratios from calibration data to form an optimal, online decision statistic M_t and applies Ville's inequality to achieve anytime-valid false-alarm control, with a PAC variant to account for estimation error. Across multiple datasets and agent-verifier configurations, it shows improved false-alarm control and increased power, enabling early termination of failing trajectories and token savings, and extends to non-LLM settings such as chess engines. The method is model-agnostic, lightweight, and complements improvements in verifiers, with code and data released for practical deployment.

Abstract

Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the success of their trajectories, researchers have developed verifiers, such as LLM judges and process-reward models, to score the quality of each action in an agent's trajectory. Although these heuristic scores can be informative, there are no guarantees of correctness when used to decide whether an agent will yield a successful output. Here, we introduce e-valuator, a method to convert any black-box verifier score into a decision rule with provable control of false alarm rates. We frame the problem of distinguishing successful trajectories (that is, a sequence of actions that will lead to a correct response to the user's prompt) and unsuccessful trajectories as a sequential hypothesis testing problem. E-valuator builds on tools from e-processes to develop a sequential hypothesis test that remains statistically valid at every step of an agent's trajectory, enabling online monitoring of agents over arbitrarily long sequences of actions. Empirically, we demonstrate that e-valuator provides greater statistical power and better false alarm rate control than other strategies across six datasets and three agents. We additionally show that e-valuator can be used for to quickly terminate problematic trajectories and save tokens. Together, e-valuator provides a lightweight, model-agnostic framework that converts verifier heuristics into decisions rules with statistical guarantees, enabling the deployment of more reliable agentic systems.

E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

TL;DR

E-valuator tackles the lack of false-alarm guarantees in verifier-based agent evaluation by framing trajectory success as a sequential hypothesis test. It learns density ratios from calibration data to form an optimal, online decision statistic M_t and applies Ville's inequality to achieve anytime-valid false-alarm control, with a PAC variant to account for estimation error. Across multiple datasets and agent-verifier configurations, it shows improved false-alarm control and increased power, enabling early termination of failing trajectories and token savings, and extends to non-LLM settings such as chess engines. The method is model-agnostic, lightweight, and complements improvements in verifiers, with code and data released for practical deployment.

Abstract

Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the success of their trajectories, researchers have developed verifiers, such as LLM judges and process-reward models, to score the quality of each action in an agent's trajectory. Although these heuristic scores can be informative, there are no guarantees of correctness when used to decide whether an agent will yield a successful output. Here, we introduce e-valuator, a method to convert any black-box verifier score into a decision rule with provable control of false alarm rates. We frame the problem of distinguishing successful trajectories (that is, a sequence of actions that will lead to a correct response to the user's prompt) and unsuccessful trajectories as a sequential hypothesis testing problem. E-valuator builds on tools from e-processes to develop a sequential hypothesis test that remains statistically valid at every step of an agent's trajectory, enabling online monitoring of agents over arbitrarily long sequences of actions. Empirically, we demonstrate that e-valuator provides greater statistical power and better false alarm rate control than other strategies across six datasets and three agents. We additionally show that e-valuator can be used for to quickly terminate problematic trajectories and save tokens. Together, e-valuator provides a lightweight, model-agnostic framework that converts verifier heuristics into decisions rules with statistical guarantees, enabling the deployment of more reliable agentic systems.

Paper Structure

This paper contains 27 sections, 3 theorems, 20 equations, 7 figures, 1 table, 3 algorithms.

Key Result

Proposition 1

For any fixed $\alpha \in (0, 1)$, Algorithm alg using the density ratio process, $M_t = p_0(\mathbf{S}_{[1:t]})/ p_1(\mathbf{S}_{[1:t]})$, and the decision threshold $c_\alpha = 1 / \alpha$ achieves anytime-valid control of the false alarm rate (Eq. eq:anytime-valid).

Figures (7)

  • Figure 1: Overview of e-valuator. E-valuator works in three steps. First, we collect a small calibration set of trajectories, verifier scores, and labels. Second, we learn the density ratios $\hat{M}_t$ at each timestep $t$. Third, we find the decision threshold that controls the false alarm rate using either Ville's inequality or a quantile-based empirical approach. For a given threshold, unsuccessful trajectories (red) should be rejected at a higher rate than successful ones (green).
  • Figure 2: E-valuator controls the false alarm rate and maximizes power better than alternative methods. Violations of the false alarm rate control are marked with an X. Both versions of e-valuator empirically control the false alarm rate (type I error) for different choices of $\alpha$ across all datasets. As expected, the $1/\alpha$ threshold is more conservative than the PAC threshold, although both control the false alarm rate. E-valuator also provides better power than competing methods. The calibrated and raw verifiers occasionally provide comparable power, at the cost of inflating the false alarm rate. All plots show the 95% CI over 50 random splits of each dataset.
  • Figure 3: E-valuator recovers a larger fraction of baseline accuracy with fewer tokens. We compare e-valuator to thresholding the verifier scores on the MATH and MMLU-Pro dataset. The verifier terminates unsuccessful trajectories rather late, leading to greater inefficiencies in recovering accuracy with fewer tokens. $\times$ indicates that the empirical false alarm rate was greater than the desired level
  • Figure 4: Chess. E-valuator controls the false alarm rate and increases power for chess verifiers.
  • Figure 5: GSM8k and MATH results. The false alarm rate is empirically controlled for both variants of e-valuator. Additionally, e-valuator achieves optimal power among methods that are able to control the false alarm rate.
  • ...and 2 more figures

Theorems & Definitions (9)

  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • proof
  • proof
  • proof