Table of Contents
Fetching ...

Structured Uncertainty guided Clarification for LLM Agents

Manan Suri, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A. Rossi, Dinesh Manocha

TL;DR

This work tackles ambiguity in tool-calling for LLM agents by grounding disambiguation in structured tool schemas and modeling joint tool-argument clarification as a POMDP. It introduces SAGE-Agent, which uses a Bayesian EVPI-based, cost-aware approach to select clarifying questions and update domain constraints, achieving higher task success with fewer questions on ClarifyBench. The paper also presents ClarifyBench, a multi-domain benchmark with realistic user simulation, and demonstrates that structured uncertainty provides effective training signals, significantly boosting When2Call performance via uncertainty-weighted reinforcement learning. Overall, structured uncertainty offers a principled, efficient framework for reliable tool-augmented agents with real-world impact across domains such as document processing, vehicle control, and travel planning.

Abstract

LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures. We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy. Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39\% while reducing clarification questions by 1.5-2.7$\times$ compared to strong prompting and uncertainty-based baselines. We present ClarifyBench, the first multi-turn tool-augmented disambiguation benchmark with realistic LLM-based user simulation across diverse domains including document editing, vehicle control, and travel booking. Additionally, we demonstrate that structured uncertainty provides effective training signals for reinforcement learning, boosting When2Call accuracy from 36.5\% to 65.2\% (3B model) and 36.7\% to 62.9\% (7B model) through uncertainty-weighted GRPO training. These results establish structured uncertainty as a principled, efficient approach for tool-augmented agents, improving both task success and interaction efficiency in real-world scenarios.

Structured Uncertainty guided Clarification for LLM Agents

TL;DR

This work tackles ambiguity in tool-calling for LLM agents by grounding disambiguation in structured tool schemas and modeling joint tool-argument clarification as a POMDP. It introduces SAGE-Agent, which uses a Bayesian EVPI-based, cost-aware approach to select clarifying questions and update domain constraints, achieving higher task success with fewer questions on ClarifyBench. The paper also presents ClarifyBench, a multi-domain benchmark with realistic user simulation, and demonstrates that structured uncertainty provides effective training signals, significantly boosting When2Call performance via uncertainty-weighted reinforcement learning. Overall, structured uncertainty offers a principled, efficient framework for reliable tool-augmented agents with real-world impact across domains such as document processing, vehicle control, and travel planning.

Abstract

LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures. We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy. Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39\% while reducing clarification questions by 1.5-2.7 compared to strong prompting and uncertainty-based baselines. We present ClarifyBench, the first multi-turn tool-augmented disambiguation benchmark with realistic LLM-based user simulation across diverse domains including document editing, vehicle control, and travel booking. Additionally, we demonstrate that structured uncertainty provides effective training signals for reinforcement learning, boosting When2Call accuracy from 36.5\% to 65.2\% (3B model) and 36.7\% to 62.9\% (7B model) through uncertainty-weighted GRPO training. These results establish structured uncertainty as a principled, efficient approach for tool-augmented agents, improving both task success and interaction efficiency in real-world scenarios.

Paper Structure

This paper contains 61 sections, 20 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: Disambiguation strategies purely grounded in linguistics fail to effectively leverage domain schemas, leading to issues like unnecessary clarifications and assumption of inappropriate default arguments. In contrast, grounding the disambiguation in the structured space of parameter domains mitigates these problems.
  • Figure 2: ClarifyBench enables comprehensive evaluation of agent clarification strategies by simulating normal, ambiguous, and infeasible user queries across five domains. A dynamic user simulator conducts multi-turn interactions with tool-equipped LLM agents, with evaluation based on alignment with ground truth agent tool calls.
  • Figure 3: SAGE-Agent: ➊) Given a user query, an LLM reasons and generates potential tool calls with possibly uncertain parameters. These tool calls undergo (βž‹) structured uncertainty quantification to determine if clarification is needed. When uncertainty exists, the agent uses an LLM to produce (➌) candidate clarifying questions, and scores them using (➍) a cost-penalized Eexpected Value of Perfect Information (EVPI) metric. Tool-parameter domain interpretation is updated based on user-response to the clarifying question (➎), and given no further uncertainty, the best tool call is executed ➏.
  • Figure 4: Resource consumption across methods for GPT-4o and Qwen2.5-14B.
  • Figure 5: Effect of $\lambda$ on performance metrics across ClarifyBench splits. Increasing $\lambda$ from 0 to 0.5 reduces #Q by 18-27% while maintaining stable Coverage, TMR, and PMR ($<3\%$ deviation).
  • ...and 2 more figures