Table of Contents
Fetching ...

Live API-Bench: 2500+ Live APIs for Testing Multi-Step Tool Calling

Benjamin Elder, Anupama Murthi, Jungkoo Kang, Ankita Rajaram Naik, Kiran Kate, Kinjal Basu, Danish Contractor

TL;DR

Live API Bench introduces a scalable benchmark that converts NL2SQL tasks into interactive API environments to stress-test LLM tool-calling in realistic, enterprise-like settings. By constructing SLOT-BIRD, SEL-BIRD, and REST-BIRD from 11 databases and over 2,500 live tools, the authors enable deterministic ground-truth evaluation of multi-step tool use, parameter generation, and response parsing. Extensive experiments across 10 LLMs and 4 ReACT agents reveal substantial gaps in current tool-calling performance, with completion rates often below 50% even for interactive agents, highlighting the need for better planning, schema understanding, and robust parsing. The work provides open artifacts and outlines future directions, including expanding to RAG tasks and multi-turn dialogues via a Model Context Protocol, to better mirror real-world API usage.

Abstract

Large language models (LLMs) increasingly rely on external tools and APIs to execute complex tasks specified in natural language. Evaluating such tool calling capabilities in realistic enterprise settings is challenging: APIs are often proprietary, heterogeneous, and difficult to share, limiting reproducible benchmarks. To address this, we introduce Live API Bench, a comprehensive benchmark constructed by transforming NL2SQL datasets into interactive API environments. Our pipeline converts SQL queries from BIRD SQL into executable API sequences across three formulations SLOT, SEL, and REST covering minimal general purpose operations, domain specific multi step tasks, and function oriented RESTful interactions, respectively. The benchmark spans 11 databases with over 2,500 invocable tools, paired with human authored queries, ground truth API sequences, and verified final answers. Live API Bench enables systematic evaluation of core challenges in tool use, including error handling, sequential reasoning, parameter generation, response parsing, and robustness across diverse domains. We evaluate 10 LLMs and 4 ReACT agents, observing low task completion rates (7 to 47pct), which improve modestly to 50pct under interactive agent settings, highlighting substantial scope for improving LLM tool calling performance. We release all code and data associated with this paper.

Live API-Bench: 2500+ Live APIs for Testing Multi-Step Tool Calling

TL;DR

Live API Bench introduces a scalable benchmark that converts NL2SQL tasks into interactive API environments to stress-test LLM tool-calling in realistic, enterprise-like settings. By constructing SLOT-BIRD, SEL-BIRD, and REST-BIRD from 11 databases and over 2,500 live tools, the authors enable deterministic ground-truth evaluation of multi-step tool use, parameter generation, and response parsing. Extensive experiments across 10 LLMs and 4 ReACT agents reveal substantial gaps in current tool-calling performance, with completion rates often below 50% even for interactive agents, highlighting the need for better planning, schema understanding, and robust parsing. The work provides open artifacts and outlines future directions, including expanding to RAG tasks and multi-turn dialogues via a Model Context Protocol, to better mirror real-world API usage.

Abstract

Large language models (LLMs) increasingly rely on external tools and APIs to execute complex tasks specified in natural language. Evaluating such tool calling capabilities in realistic enterprise settings is challenging: APIs are often proprietary, heterogeneous, and difficult to share, limiting reproducible benchmarks. To address this, we introduce Live API Bench, a comprehensive benchmark constructed by transforming NL2SQL datasets into interactive API environments. Our pipeline converts SQL queries from BIRD SQL into executable API sequences across three formulations SLOT, SEL, and REST covering minimal general purpose operations, domain specific multi step tasks, and function oriented RESTful interactions, respectively. The benchmark spans 11 databases with over 2,500 invocable tools, paired with human authored queries, ground truth API sequences, and verified final answers. Live API Bench enables systematic evaluation of core challenges in tool use, including error handling, sequential reasoning, parameter generation, response parsing, and robustness across diverse domains. We evaluate 10 LLMs and 4 ReACT agents, observing low task completion rates (7 to 47pct), which improve modestly to 50pct under interactive agent settings, highlighting substantial scope for improving LLM tool calling performance. We release all code and data associated with this paper.

Paper Structure

This paper contains 43 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: A sample NL2SQL instance along with the ground-truth API sequence for SLOT-BIRD, SEL-BIRD and REST-BIRD. Function, slot and slot value descriptions not shown for ease of presentation.
  • Figure 2: Tool-calling error classification
  • Figure 3: The pre-obfuscation (Listing \ref{['lst:unobfuscated']}) and post-obfuscation (Listing \ref{['lst:obfuscated']}) API specification for an example API endpoint from REST-BIRD.
  • Figure 4: Effect of obfuscation on completion rate
  • Figure 5: Effect of number of tools (percentage of universe) available on completion rate
  • ...and 1 more figures