It's LIT! Reliability-Optimized LLMs with Inspectable Tools
Ruixin Zhang, Jon Donnelly, Zhicheng Guo, Ghazal Khalighinejad, Haiyang Huang, Alina Jade Barnett, Cynthia Rudin
TL;DR
LLMs are powerful but opaque, complicating troubleshooting in high-stakes tasks. The authors propose LIT, a reliability- and inspectability-driven framework that prompts LLMs to select external tools based on a per-tool cost function that combines $P$, $D$, and $C$ to minimize the total cost $\sum_i cost_i$. They introduce a 1,300-question benchmark spanning the Harvard USPTO Patent Dataset and the NeurIPS 2023 Papers Dataset, plus a toolkit of eight tools including Calculator, DBLoader, PandasInterpreter, PythonInterpreter, Forecaster, TextualClassifier, LLMInferencer, and Finish. Empirical results across multiple LLMs show that LIT improves inspectability and reliability while largely preserving task performance, though some hard or distribution-shifted problems remain challenging. This work advances trustworthy LLM deployment by enabling transparent debugging and modular tool use, while also highlighting remaining challenges in token efficiency for extensive prompting.
Abstract
Large language models (LLMs) have exhibited remarkable capabilities across various domains. The ability to call external tools further expands their capability to handle real-world tasks. However, LLMs often follow an opaque reasoning process, which limits their usefulness in high-stakes domains where solutions need to be trustworthy to end users. LLMs can choose solutions that are unreliable and difficult to troubleshoot, even if better options are available. We address this issue by forcing LLMs to use external -- more reliable -- tools to solve problems when possible. We present a framework built on the tool-calling capabilities of existing LLMs to enable them to select the most reliable and easy-to-troubleshoot solution path, which may involve multiple sequential tool calls. We refer to this framework as LIT (LLMs with Inspectable Tools). In order to support LIT, we introduce a new and challenging benchmark dataset of 1,300 questions and a customizable set of reliability cost functions associated with a collection of specialized tools. These cost functions summarize how reliable each tool is and how easy it is to troubleshoot. For instance, a calculator is reliable across domains, whereas a linear prediction model is not reliable if there is distribution shift, but it is easy to troubleshoot. A tool that constructs a random forest is neither reliable nor easy to troubleshoot. These tools interact with the Harvard USPTO Patent Dataset and a new dataset of NeurIPS 2023 papers to solve mathematical, coding, and modeling problems of varying difficulty levels. We demonstrate that LLMs can achieve more reliable and informed problem-solving while maintaining task performance using our framework.
