Table of Contents
Fetching ...

It's LIT! Reliability-Optimized LLMs with Inspectable Tools

Ruixin Zhang, Jon Donnelly, Zhicheng Guo, Ghazal Khalighinejad, Haiyang Huang, Alina Jade Barnett, Cynthia Rudin

TL;DR

LLMs are powerful but opaque, complicating troubleshooting in high-stakes tasks. The authors propose LIT, a reliability- and inspectability-driven framework that prompts LLMs to select external tools based on a per-tool cost function that combines $P$, $D$, and $C$ to minimize the total cost $\sum_i cost_i$. They introduce a 1,300-question benchmark spanning the Harvard USPTO Patent Dataset and the NeurIPS 2023 Papers Dataset, plus a toolkit of eight tools including Calculator, DBLoader, PandasInterpreter, PythonInterpreter, Forecaster, TextualClassifier, LLMInferencer, and Finish. Empirical results across multiple LLMs show that LIT improves inspectability and reliability while largely preserving task performance, though some hard or distribution-shifted problems remain challenging. This work advances trustworthy LLM deployment by enabling transparent debugging and modular tool use, while also highlighting remaining challenges in token efficiency for extensive prompting.

Abstract

Large language models (LLMs) have exhibited remarkable capabilities across various domains. The ability to call external tools further expands their capability to handle real-world tasks. However, LLMs often follow an opaque reasoning process, which limits their usefulness in high-stakes domains where solutions need to be trustworthy to end users. LLMs can choose solutions that are unreliable and difficult to troubleshoot, even if better options are available. We address this issue by forcing LLMs to use external -- more reliable -- tools to solve problems when possible. We present a framework built on the tool-calling capabilities of existing LLMs to enable them to select the most reliable and easy-to-troubleshoot solution path, which may involve multiple sequential tool calls. We refer to this framework as LIT (LLMs with Inspectable Tools). In order to support LIT, we introduce a new and challenging benchmark dataset of 1,300 questions and a customizable set of reliability cost functions associated with a collection of specialized tools. These cost functions summarize how reliable each tool is and how easy it is to troubleshoot. For instance, a calculator is reliable across domains, whereas a linear prediction model is not reliable if there is distribution shift, but it is easy to troubleshoot. A tool that constructs a random forest is neither reliable nor easy to troubleshoot. These tools interact with the Harvard USPTO Patent Dataset and a new dataset of NeurIPS 2023 papers to solve mathematical, coding, and modeling problems of varying difficulty levels. We demonstrate that LLMs can achieve more reliable and informed problem-solving while maintaining task performance using our framework.

It's LIT! Reliability-Optimized LLMs with Inspectable Tools

TL;DR

LLMs are powerful but opaque, complicating troubleshooting in high-stakes tasks. The authors propose LIT, a reliability- and inspectability-driven framework that prompts LLMs to select external tools based on a per-tool cost function that combines , , and to minimize the total cost . They introduce a 1,300-question benchmark spanning the Harvard USPTO Patent Dataset and the NeurIPS 2023 Papers Dataset, plus a toolkit of eight tools including Calculator, DBLoader, PandasInterpreter, PythonInterpreter, Forecaster, TextualClassifier, LLMInferencer, and Finish. Empirical results across multiple LLMs show that LIT improves inspectability and reliability while largely preserving task performance, though some hard or distribution-shifted problems remain challenging. This work advances trustworthy LLM deployment by enabling transparent debugging and modular tool use, while also highlighting remaining challenges in token efficiency for extensive prompting.

Abstract

Large language models (LLMs) have exhibited remarkable capabilities across various domains. The ability to call external tools further expands their capability to handle real-world tasks. However, LLMs often follow an opaque reasoning process, which limits their usefulness in high-stakes domains where solutions need to be trustworthy to end users. LLMs can choose solutions that are unreliable and difficult to troubleshoot, even if better options are available. We address this issue by forcing LLMs to use external -- more reliable -- tools to solve problems when possible. We present a framework built on the tool-calling capabilities of existing LLMs to enable them to select the most reliable and easy-to-troubleshoot solution path, which may involve multiple sequential tool calls. We refer to this framework as LIT (LLMs with Inspectable Tools). In order to support LIT, we introduce a new and challenging benchmark dataset of 1,300 questions and a customizable set of reliability cost functions associated with a collection of specialized tools. These cost functions summarize how reliable each tool is and how easy it is to troubleshoot. For instance, a calculator is reliable across domains, whereas a linear prediction model is not reliable if there is distribution shift, but it is easy to troubleshoot. A tool that constructs a random forest is neither reliable nor easy to troubleshoot. These tools interact with the Harvard USPTO Patent Dataset and a new dataset of NeurIPS 2023 papers to solve mathematical, coding, and modeling problems of varying difficulty levels. We demonstrate that LLMs can achieve more reliable and informed problem-solving while maintaining task performance using our framework.

Paper Structure

This paper contains 24 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Conceptual summary. By using the LIT framework, the LLM selects more reliable and inspectable tools, allowing human users to debug and refine the solution proposed by the LLM, producing the correct result. In contrast, the vanilla LLM without LIT selects uninspectable solutions with reasoning that cannot be inspected or corrected.
  • Figure 2: A simplified example of the prompting framework used by LIT. In LIT, we provide the model a cost for each tool and instruct the model to provide multiple alternative solutions, selecting the one with the lowest cost when possible.
  • Figure 3: Solutions from an LLM to a question about whether a paper would be accepted to NeurIPS, with and without LIT. When using LIT, the model generates multiple candidate solutions and selects the one with the best cost. As a result, the model with LIT uses a logistic regression model rather than a BERT model to form its prediction, thereby using a substantially more inspectable tool. "Finish" tool signifies the end of logic stream and checks that the correct data type is returned. It has 0 cost.
  • Figure 4: Solutions from an LLM to a question about future patent application acceptance, with and without LIT. When using LIT, the model generates multiple candidate solutions and selects the one with the best cost. As a result, the model with LIT chooses to use a simple average based on historical data rather than the ARIMA model used without LIT.
  • Figure 5: The full prompt presented to models in the LIT framework (continued in Figure \ref{['fig:full-prompt-end']}). Each LLM is presented with a set of costs, and instructed to form solutions that minimize these costs. A number of example solutions are provided.
  • ...and 1 more figures