Table of Contents
Fetching ...

API Pack: A Massive Multi-Programming Language Dataset for API Call Generation

Zhen Guo, Adriana Meza Soria, Wei Sun, Yikang Shen, Rameswar Panda

TL;DR

API Pack delivers the largest open-source instruction dataset for API call generation across 10 programming languages, enabling targeted fine-tuning of LLMs to generate API calls from natural language. The study shows that fine-tuning CodeLlama-13B on a modest Python subset can outperform leading proprietary models on unseen APIs, and that scaling to one million instances further improves generalization, including cross-language transfer with minimal multi-language data. Retrieval-augmented prompting and multi-source data contribute to robust performance, with a mixture-model approach proving competitive across languages. While promising, the work notes limitations in evaluation realism and licensing, and points to future work on broader API scenarios, privacy-preserving benchmarks, and expanded language coverage.

Abstract

We introduce API Pack, a massive multi-programming language dataset containing over one million instruction-API calls for improving the API call generation capabilities of large language models. Our evaluation highlights three key findings: First, fine-tuning on API Pack enables open-source models to outperform GPT-3.5 and GPT-4 in generating code for entirely new API calls. We show this by fine-tuning CodeLlama-13B on 20,000 Python instances from API Pack. Second, fine-tuning on a large dataset in one language, combined with smaller datasets from others, improves API generation accuracy across multiple languages. Third, we confirm the benefits of larger datasets for API generalization, as increasing fine-tuning data to one million instances enhances generalization to new APIs. To support further research, we open-source the API Pack dataset, trained model, and code at https://github.com/zguo0525/API-Pack.

API Pack: A Massive Multi-Programming Language Dataset for API Call Generation

TL;DR

API Pack delivers the largest open-source instruction dataset for API call generation across 10 programming languages, enabling targeted fine-tuning of LLMs to generate API calls from natural language. The study shows that fine-tuning CodeLlama-13B on a modest Python subset can outperform leading proprietary models on unseen APIs, and that scaling to one million instances further improves generalization, including cross-language transfer with minimal multi-language data. Retrieval-augmented prompting and multi-source data contribute to robust performance, with a mixture-model approach proving competitive across languages. While promising, the work notes limitations in evaluation realism and licensing, and points to future work on broader API scenarios, privacy-preserving benchmarks, and expanded language coverage.

Abstract

We introduce API Pack, a massive multi-programming language dataset containing over one million instruction-API calls for improving the API call generation capabilities of large language models. Our evaluation highlights three key findings: First, fine-tuning on API Pack enables open-source models to outperform GPT-3.5 and GPT-4 in generating code for entirely new API calls. We show this by fine-tuning CodeLlama-13B on 20,000 Python instances from API Pack. Second, fine-tuning on a large dataset in one language, combined with smaller datasets from others, improves API generation accuracy across multiple languages. Third, we confirm the benefits of larger datasets for API generalization, as increasing fine-tuning data to one million instances enhances generalization to new APIs. To support further research, we open-source the API Pack dataset, trained model, and code at https://github.com/zguo0525/API-Pack.
Paper Structure (36 sections, 1 equation, 9 figures, 8 tables)

This paper contains 36 sections, 1 equation, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Dataset curation pipeline.
  • Figure 2: Comparison of 0-shot and 3-shot API call performance for different models in cURL, Python, and Java. Note that the expert models are specific to each programming language.
  • Figure 3: Three-shot performance in ten languages across different models.
  • Figure 4: Scaling instruction dataset, with 3-shot fine-tuning template, on CodeLlama-13b with 0-shot and 3-shot retrieval evaluations. The x-axis represents the log scale size of fine-tuning data from API Pack. The y-axis is Endpoint or API call accuracy.
  • Figure 5: Testing pipeline.
  • ...and 4 more figures