SAUP: Situation Awareness Uncertainty Propagation on LLM Agent
Qiwei Zhao, Xujiang Zhao, Yanchi Liu, Wei Cheng, Yiyou Sun, Mika Oishi, Takao Osaki, Katsushi Matsuda, Huaxiu Yao, Haifeng Chen
TL;DR
This work tackles the challenge of uncertainty estimation in multi-step LLM-based agents, where traditional single-step metrics fail to capture cumulative errors and environment interactions. It introduces SAUP, a framework that propagates uncertainty across each reasoning step and combines them using situation-aware weights, formalized with $U_{agent} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (W_i U_i)^2}$ and grounded in RMS aggregation. SAUP leverages one-step estimators as a backbone and enhances them with surrogates (notably SAUP-HMMD, a CHMM-based approach) to infer the agent’s situational context via distances derived from RoBERTa and Baum-Welch training. Empirical results on HotpotQA, StrategyQA, and MMLU show SAUP achieving higher AUROC than prior methods, with gains up to 20%, demonstrating improved reliability for complex, high-stakes, multi-step decision-making in LLM-driven agents.
Abstract
Large language models (LLMs) integrated into multistep agent systems enable complex decision-making processes across various applications. However, their outputs often lack reliability, making uncertainty estimation crucial. Existing uncertainty estimation methods primarily focus on final-step outputs, which fail to account for cumulative uncertainty over the multistep decision-making process and the dynamic interactions between agents and their environments. To address these limitations, we propose SAUP (Situation Awareness Uncertainty Propagation), a novel framework that propagates uncertainty through each step of an LLM-based agent's reasoning process. SAUP incorporates situational awareness by assigning situational weights to each step's uncertainty during the propagation. Our method, compatible with various one-step uncertainty estimation techniques, provides a comprehensive and accurate uncertainty measure. Extensive experiments on benchmark datasets demonstrate that SAUP significantly outperforms existing state-of-the-art methods, achieving up to 20% improvement in AUROC.
