E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing
Shuvom Sadhuka, Drew Prinster, Clara Fannjiang, Gabriele Scalia, Aviv Regev, Hanchen Wang
TL;DR
E-valuator tackles the lack of false-alarm guarantees in verifier-based agent evaluation by framing trajectory success as a sequential hypothesis test. It learns density ratios from calibration data to form an optimal, online decision statistic M_t and applies Ville's inequality to achieve anytime-valid false-alarm control, with a PAC variant to account for estimation error. Across multiple datasets and agent-verifier configurations, it shows improved false-alarm control and increased power, enabling early termination of failing trajectories and token savings, and extends to non-LLM settings such as chess engines. The method is model-agnostic, lightweight, and complements improvements in verifiers, with code and data released for practical deployment.
Abstract
Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the success of their trajectories, researchers have developed verifiers, such as LLM judges and process-reward models, to score the quality of each action in an agent's trajectory. Although these heuristic scores can be informative, there are no guarantees of correctness when used to decide whether an agent will yield a successful output. Here, we introduce e-valuator, a method to convert any black-box verifier score into a decision rule with provable control of false alarm rates. We frame the problem of distinguishing successful trajectories (that is, a sequence of actions that will lead to a correct response to the user's prompt) and unsuccessful trajectories as a sequential hypothesis testing problem. E-valuator builds on tools from e-processes to develop a sequential hypothesis test that remains statistically valid at every step of an agent's trajectory, enabling online monitoring of agents over arbitrarily long sequences of actions. Empirically, we demonstrate that e-valuator provides greater statistical power and better false alarm rate control than other strategies across six datasets and three agents. We additionally show that e-valuator can be used for to quickly terminate problematic trajectories and save tokens. Together, e-valuator provides a lightweight, model-agnostic framework that converts verifier heuristics into decisions rules with statistical guarantees, enabling the deployment of more reliable agentic systems.
