Table of Contents
Fetching ...

Tools in the Loop: Quantifying Uncertainty of LLM Question Answering Systems That Use Tools

Panagiotis Lymperopoulos, Vasanth Sarathy

TL;DR

This paper presents a probabilistic framework for uncertainty quantification in tool-calling LLMs by jointly modeling the LLM and external tool uncertainties under a white-box setting. It extends uncertainty quantification to tool-augmented systems through entropy-based measures, including predictive and semantic entropy, and introduces an efficient Strong Tool Approximation (STA) to render computation practical. The method is validated on two synthetic QA datasets requiring tool calls (IRIS QA and Diabetes QA) and a retrieval-augmented generation (RAG) scenario, showing that STA-based metrics often better predict when the system’s answers are trustworthy compared to Baseline entropy over final outputs. The work highlights the framework’s modularity, enabling integration with various uncertainty estimators and tool architectures, and discusses broader implications for safe and reliable deployment of tool-using LLM agents in high-stakes domains.

Abstract

Modern Large Language Models (LLMs) often require external tools, such as machine learning classifiers or knowledge retrieval systems, to provide accurate answers in domains where their pre-trained knowledge is insufficient. This integration of LLMs with external tools expands their utility but also introduces a critical challenge: determining the trustworthiness of responses generated by the combined system. In high-stakes applications, such as medical decision-making, it is essential to assess the uncertainty of both the LLM's generated text and the tool's output to ensure the reliability of the final response. However, existing uncertainty quantification methods do not account for the tool-calling scenario, where both the LLM and external tool contribute to the overall system's uncertainty. In this work, we present a novel framework for modeling tool-calling LLMs that quantifies uncertainty by jointly considering the predictive uncertainty of the LLM and the external tool. We extend previous methods for uncertainty quantification over token sequences to this setting and propose efficient approximations that make uncertainty computation practical for real-world applications. We evaluate our framework on two new synthetic QA datasets, derived from well-known machine learning datasets, which require tool-calling for accurate answers. Additionally, we apply our method to retrieval-augmented generation (RAG) systems and conduct a proof-of-concept experiment demonstrating the effectiveness of our uncertainty metrics in scenarios where external information retrieval is needed. Our results show that the framework is effective in enhancing trust in LLM-based systems, especially in cases where the LLM's internal knowledge is insufficient and external tools are required.

Tools in the Loop: Quantifying Uncertainty of LLM Question Answering Systems That Use Tools

TL;DR

This paper presents a probabilistic framework for uncertainty quantification in tool-calling LLMs by jointly modeling the LLM and external tool uncertainties under a white-box setting. It extends uncertainty quantification to tool-augmented systems through entropy-based measures, including predictive and semantic entropy, and introduces an efficient Strong Tool Approximation (STA) to render computation practical. The method is validated on two synthetic QA datasets requiring tool calls (IRIS QA and Diabetes QA) and a retrieval-augmented generation (RAG) scenario, showing that STA-based metrics often better predict when the system’s answers are trustworthy compared to Baseline entropy over final outputs. The work highlights the framework’s modularity, enabling integration with various uncertainty estimators and tool architectures, and discusses broader implications for safe and reliable deployment of tool-using LLM agents in high-stakes domains.

Abstract

Modern Large Language Models (LLMs) often require external tools, such as machine learning classifiers or knowledge retrieval systems, to provide accurate answers in domains where their pre-trained knowledge is insufficient. This integration of LLMs with external tools expands their utility but also introduces a critical challenge: determining the trustworthiness of responses generated by the combined system. In high-stakes applications, such as medical decision-making, it is essential to assess the uncertainty of both the LLM's generated text and the tool's output to ensure the reliability of the final response. However, existing uncertainty quantification methods do not account for the tool-calling scenario, where both the LLM and external tool contribute to the overall system's uncertainty. In this work, we present a novel framework for modeling tool-calling LLMs that quantifies uncertainty by jointly considering the predictive uncertainty of the LLM and the external tool. We extend previous methods for uncertainty quantification over token sequences to this setting and propose efficient approximations that make uncertainty computation practical for real-world applications. We evaluate our framework on two new synthetic QA datasets, derived from well-known machine learning datasets, which require tool-calling for accurate answers. Additionally, we apply our method to retrieval-augmented generation (RAG) systems and conduct a proof-of-concept experiment demonstrating the effectiveness of our uncertainty metrics in scenarios where external information retrieval is needed. Our results show that the framework is effective in enhancing trust in LLM-based systems, especially in cases where the LLM's internal knowledge is insufficient and external tools are required.

Paper Structure

This paper contains 21 sections, 10 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of our model of the LLM+tool system. The system receives an input prompt $x$, such as a question that requires a tool (e.g. a classifier) to answer. The LLM produces a tool call $a$ which acts as input to the tool. The tool produces output $z$, which in turn is mapped to a token sequence and provided to the LLM, alongside the original prompt. Finally, the LLM produces the final answer $y$. Yellow indicates the features that the LLM needs to extract for the tool call. Green indicates the final question the system needs to answer and red indicates the final answer provided by the combined system. Within this framework we can quantify the uncertainty of the final answer while taking into account the uncertainty of the classifier.
  • Figure 2: Illustration of our framework applied to RAG. The system receives an input question $x$, shown in green, that requires additional documents to answer. In our framework, we describe the document retriever as a categorical distribution over documents. The system samples relevant documents $z$ from that distribution which are added to the LLM context. Finally, the LLM produces the final answer $y$. Yellow indicates the most relevant passage. The bar plot shows that in this example the retrieval distribution is low entropy, so uncertainty in the retrieval is low. Red indicates the final answer provided by the combined system. Within this framework we can quantify the uncertainty of the overall system answer, taking into account the uncertainty of the retreival system.
  • Figure 3: Samples from the IRIS QA and the Diabetes QA datasets. Yellow indicates the portion of the question that the LLM needs to extract features for the tool call from. Green indicates the question that the system needs to answer based on the prompt and the tool response. Red indicates the answer.