Table of Contents
Fetching ...

API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs

Kinjal Basu, Ibrahim Abdelaziz, Subhajit Chaudhury, Soham Dan, Maxwell Crouse, Asim Munawar, Sadhana Kumaravel, Vinod Muthusamy, Pavan Kapanipathi, Luis A. Lastras

TL;DR

API-BLEND addresses the need for robust tool-augmented LLM training and benchmarking by curating seven multi-intent API datasets from existing benchmarks and transforming them into API-call sequences. The authors introduce prompt-based data curation and heuristics to generate diverse, API-focused tasks, and they benchmark open-source models with both in-domain and out-of-domain evaluations, including instruction tuning with QLoRA. Key contributions include the seven derived datasets (SeqSGD, SeqMultiWoz, SeqATIS, SeqSNIPS, SeqTopV2, SeqToolQA, ToolBench), a unified evaluation framework, and comparative analyses against ToolLLaMA-2 and other baselines. The work demonstrates that fine-tuning on diverse, tool-oriented data improves API call accuracy and sequence adherence, while also revealing challenges in slot-value normalization and slot-name ambiguity that affect generalization and OOD performance. Overall, API-BLEND provides a practical, open-resource benchmark for training and evaluating tool-augmented LLMs across multiple domains and task types.

Abstract

There is a growing need for Large Language Models (LLMs) to effectively use tools and external Application Programming Interfaces (APIs) to plan and complete tasks. As such, there is tremendous interest in methods that can acquire sufficient quantities of train and test data that involve calls to tools / APIs. Two lines of research have emerged as the predominant strategies for addressing this challenge. The first has focused on synthetic data generation techniques, while the second has involved curating task-adjacent datasets which can be transformed into API / Tool-based tasks. In this paper, we focus on the task of identifying, curating, and transforming existing datasets and, in turn, introduce API-BLEND, a large corpora for training and systematic testing of tool-augmented LLMs. The datasets mimic real-world scenarios involving API-tasks such as API / tool detection, slot filling, and sequencing of the detected APIs. We demonstrate the utility of the API-BLEND dataset for both training and benchmarking purposes.

API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs

TL;DR

API-BLEND addresses the need for robust tool-augmented LLM training and benchmarking by curating seven multi-intent API datasets from existing benchmarks and transforming them into API-call sequences. The authors introduce prompt-based data curation and heuristics to generate diverse, API-focused tasks, and they benchmark open-source models with both in-domain and out-of-domain evaluations, including instruction tuning with QLoRA. Key contributions include the seven derived datasets (SeqSGD, SeqMultiWoz, SeqATIS, SeqSNIPS, SeqTopV2, SeqToolQA, ToolBench), a unified evaluation framework, and comparative analyses against ToolLLaMA-2 and other baselines. The work demonstrates that fine-tuning on diverse, tool-oriented data improves API call accuracy and sequence adherence, while also revealing challenges in slot-value normalization and slot-name ambiguity that affect generalization and OOD performance. Overall, API-BLEND provides a practical, open-resource benchmark for training and evaluating tool-augmented LLMs across multiple domains and task types.

Abstract

There is a growing need for Large Language Models (LLMs) to effectively use tools and external Application Programming Interfaces (APIs) to plan and complete tasks. As such, there is tremendous interest in methods that can acquire sufficient quantities of train and test data that involve calls to tools / APIs. Two lines of research have emerged as the predominant strategies for addressing this challenge. The first has focused on synthetic data generation techniques, while the second has involved curating task-adjacent datasets which can be transformed into API / Tool-based tasks. In this paper, we focus on the task of identifying, curating, and transforming existing datasets and, in turn, introduce API-BLEND, a large corpora for training and systematic testing of tool-augmented LLMs. The datasets mimic real-world scenarios involving API-tasks such as API / tool detection, slot filling, and sequencing of the detected APIs. We demonstrate the utility of the API-BLEND dataset for both training and benchmarking purposes.
Paper Structure (24 sections, 4 figures, 3 tables)

This paper contains 24 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Example of the creation process of seqSGD. Starting from the list of APIs, we use few-shot prompting to generate the summarized single utterance.
  • Figure 2: Example of how SeqSNIPS is created. Using a natural language utterance from MixSNIPS and the flat list of slots, we convert it into a sequence of API calls, each with a dictionary of parameter names and values.
  • Figure 3: Example of the creation process of SeqTopV2. Starting with the annotated semantic parse with mixed intents and slots, we convert it into a sequence of APIs.
  • Figure 4: Example of the creation process of SeqTopV2. Starting with the annotated semantic parse with mixed intents and slots, we convert it into a sequence of APIs.