Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

Yichi Zhang; Nabeel Seedat; Yinpeng Dong; Peng Cui; Jun Zhu; Mihaela van de Schaar

Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

Yichi Zhang, Nabeel Seedat, Yinpeng Dong, Peng Cui, Jun Zhu, Mihaela van de Schaar

TL;DR

GLEAN is established, an agent verification framework with Guideline-grounded Evidence Accumulation that compiles expert-curated protocols into trajectory-informed, well-calibrated correctness signals and empirically validate with agentic clinical diagnosis across three diseases from the MIMIC-IV dataset.

Abstract

As LLM-powered agents have been used for high-stakes decision-making, such as clinical diagnosis, it becomes critical to develop reliable verification of their decisions to facilitate trustworthy deployment. Yet, existing verifiers usually underperform owing to a lack of domain knowledge and limited calibration. To address this, we establish GLEAN, an agent verification framework with Guideline-grounded Evidence Accumulation that compiles expert-curated protocols into trajectory-informed, well-calibrated correctness signals. GLEAN evaluates the step-wise alignment with domain guidelines and aggregates multi-guideline ratings into surrogate features, which are accumulated along the trajectory and calibrated into correctness probabilities using Bayesian logistic regression. Moreover, the estimated uncertainty triggers active verification, which selectively collects additional evidence for uncertain cases via expanding guideline coverage and performing differential checks. We empirically validate GLEAN with agentic clinical diagnosis across three diseases from the MIMIC-IV dataset, surpassing the best baseline by 12% in AUROC and 50% in Brier score reduction, which confirms the effectiveness in both discrimination and calibration. In addition, the expert study with clinicians recognizes GLEAN's utility in practice.

Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

TL;DR

Abstract

Paper Structure (24 sections, 1 theorem, 19 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 24 sections, 1 theorem, 19 equations, 7 figures, 9 tables, 1 algorithm.

Introduction
Related Work
LLM-powered Agents
LLM and Agent Verification
Guideline-Grounded Evidence Accumulation
Verification as Sequential Evidence Accumulation
Guideline-Grounded Surrogate Evidence
Robust and Active Evidence Accumulation
Experiments
Experimental Setup
Main Results
Detailed Analysis
Conclusion
Properties of Guideline-Grounded Signals
Error Bound Analysis
...and 9 more sections

Key Result

Proposition 1.3

Let $\mathcal{F}\subseteq \{f:\mathbb{R}\rightarrow[0,1]\}$ be a class of probabilistic calibrators, which contains the linear-logit model $g^\ast(S)=\sigma(aS+c)$ as introduced in assump:linearity. Given $M$ calibration samples $\{(S^m, Z^{m})\}_{m=1}^M$, define the Brier squared loss risk and its and let $\hat{f}$ be the empirical risk minimizer (ERM) such that $\hat{f}\triangleq\arg\min_{f\in\

Figures (7)

Figure 1: Guideline-grounded verification in clinical diagnosis. For an agent clinician with access to different examinations (bottom), GLEAN verifies its diagnosis by assessing alignment with clinical guidelines (top) at each step. The example on a patient with acute diverticulitis illustrates how GLEAN accumulates guideline-grounded evidence into calibrated correctness probabilities along the execution (middle). While initial history at the first step aligns well with criteria, physical examination contradicts guidelines on abdominal tenderness and fever, dropping confidence to 0.5. Then, laboratory results recover the confidence, and CT imaging at the last step further confirms the diagnosis with higher confidence.
Figure 2: Properties of guideline-grounded signals. Top: Signals grounded in guidelines significantly discriminate correct from incorrect prefixes, while uninformed signals do not. Bottom: Guideline-grounded signals exhibit strong logit-linearity with correctness, but uninformed ones lack this property.
Figure 3: Pipeline of GLEAN for clinical diagnosis. GLEAN (i) retrieves guidelines for the final diagnosis, (ii) aggregates step-wise scores for alignment with multiple guidelines, (iii) accumulates them into $\beta$-discounted evidence, which is (iv) calibrated to yield confidence and uncertainty. (v) High uncertainty then triggers active verification via guideline expansion and differential checks.
Figure 4: Accuracy of Best-of-N selection using different signals.
Figure 5: Performance with active verification.
...and 2 more figures

Theorems & Definitions (2)

Proposition 1.3
proof

Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

TL;DR

Abstract

Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (2)