TRACER: Trajectory Risk Aggregation for Critical Episodes in Agentic Reasoning

Sina Tayebati; Divake Kumar; Nastaran Darabi; Davide Ettori; Ranganath Krishnan; Amit Ranjan Trivedi

TRACER: Trajectory Risk Aggregation for Critical Episodes in Agentic Reasoning

Sina Tayebati, Divake Kumar, Nastaran Darabi, Davide Ettori, Ranganath Krishnan, Amit Ranjan Trivedi

TL;DR

TRACER tackles uncertainty estimation in long, multi-turn tool-using dialogues by reframing failures as sparse, trajectory-level events rather than token-level mistakes. It combines Stage I content-aware surprisal, Stage II situation-awareness indicators for repetition and coherence, and Stage III MAX-based tail-risk aggregation to produce a robust trajectory risk score with theoretical guarantees. Applied to the τ^2-bench dual-control environment, TRACER improves failure prediction (AUROC) and selective execution (AUARC) and provides earlier warning signals compared with token-based baselines. The approach enables safer, more reliable abstention and intervention in real-world agentic systems that involve human-in-the-loop and tool use.

Abstract

Estimating uncertainty for AI agents in real-world multi-turn tool-using interaction with humans is difficult because failures are often triggered by sparse critical episodes (e.g., looping, incoherent tool use, or user-agent miscoordination) even when local generation appears confident. Existing uncertainty proxies focus on single-shot text generation and therefore miss these trajectory-level breakdown signals. We introduce TRACER, a trajectory-level uncertainty metric for dual-control Tool-Agent-User interaction. TRACER combines content-aware surprisal with situational-awareness signals, semantic and lexical repetition, and tool-grounded coherence gaps, and aggregates them using a tail-focused risk functional with a MAX-composite step risk to surface decisive anomalies. We evaluate TRACER on $τ^2$-bench by predicting task failure and selective task execution. To this end, TRACER improves AUROC by up to 37.1% and AUARC by up to 55% over baselines, enabling earlier and more accurate detection of uncertainty in complex conversational tool-use settings. Our code and benchmark are available at https://github.com/sinatayebati/agent-tracer.

TRACER: Trajectory Risk Aggregation for Critical Episodes in Agentic Reasoning

TL;DR

Abstract

-bench by predicting task failure and selective task execution. To this end, TRACER improves AUROC by up to 37.1% and AUARC by up to 55% over baselines, enabling earlier and more accurate detection of uncertainty in complex conversational tool-use settings. Our code and benchmark are available at https://github.com/sinatayebati/agent-tracer.

Paper Structure (33 sections, 7 theorems, 54 equations, 2 figures, 4 tables)

This paper contains 33 sections, 7 theorems, 54 equations, 2 figures, 4 tables.

Introduction
Related Work
Uncertainty estimation in language models.
Calibration, selective prediction, and risk measures.
Situation awareness and failure modes in agentic systems.
Tool-using agents and multi-turn benchmarks.
Contributions and novelty.
Methodology
Dual-Control Trajectory Model
Stage I: Content-Aware Normalized Surprisal
Stage II: Situational Awareness Indicators
Hybrid Local Repetition Functional
Inference-Gap (Coherence) Functionals
Stage III: MAX-Composite Step Risk and Tail Aggregation
MAX-composite step risk
...and 18 more sections

Key Result

Theorem A.1

Assume $|\mathcal{I}_t|>0$ and $Q_{t,j}\ll P_{t,j}$ for all $j\in\mathcal{I}_t$: Here $H(\cdot)$ denotes Shannon entropy and $\mathrm{KL}(\cdot\|\cdot)$ denotes Kullback-Leibler divergence. Moreover, the sample statistic $U_t$ is an unbiased estimator of $H_t^{\mathrm{cont}}(Q,P)$ conditional on $(W_{t,<j},\mathcal{C}_t)$.

Figures (2)

Figure 1: Overview of TRACER for trajectory-level uncertainty estimation in agentic reasoning. Left: A multi-turn Agent--User interaction with tool calls and delayed failure resolution. Right: At each agent step $t$, content-aware surprisal $U_t$, agent repetition $D_a(t)$, action--observation mismatch $D_o^{A}(t)$, and user--agent coordination gap $D_o^{U}(t)$ are computed and combined via a MAX-composite step risk, $r_t = \max(U_t, \alpha D_a(t), \beta D_o^{A}(t), \gamma D_o^{U}(t))$. Trajectory risk is obtained through tail-focused aggregation (top-$K$ mean and $\ell_\infty$ norm).
Figure 2: Early-warning detection curves showing the proportion of failed tasks detected by each metric as a function of trajectory progress. TRACER consistently signals failures earlier, especially within the first 20% (highlighted) of execution.

Theorems & Definitions (14)

Theorem A.1: Decomposition to uncertainty and mismatch
proof
Theorem A.2: Coherence of $\rho_{k,w}$
proof
Lemma A.3: Max-aggregation is nonexpansive
proof
Theorem A.4: $\ell_\infty$-Lipschitz stability
proof
Lemma A.5: Union bound for breakdown probability
proof
...and 4 more

TRACER: Trajectory Risk Aggregation for Critical Episodes in Agentic Reasoning

TL;DR

Abstract

TRACER: Trajectory Risk Aggregation for Critical Episodes in Agentic Reasoning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (14)