Table of Contents
Fetching ...

Harnessing LLMs for API Interactions: A Framework for Classification and Synthetic Data Generation

Chunliang Tao, Xiaojing Fan, Yahe Yang

TL;DR

This work addresses the problem of translating natural language commands into API calls and evaluating LLMs for API management by proposing a two-component framework: an API retrieval system and a synthetic data generation pipeline. The API module maps user queries to calls, executes them with caching, and returns results, while the dataset pipeline generates 1300 labeled synthetic queries across six modules to benchmark multiple LLMs. Results show GPT-4 achieving near-perfect module- and function-level accuracy ($MLC$-Acc ≈ 0.992, $FLC$-Acc ≈ 0.996), with markedly lower performance from smaller models like 8B LLaMA, demonstrating significant model-size effects on API classification. The approach delivers a practical, scalable pathway for customizing and evaluating LLMs for API management and provides a reproducible synthetic-data workflow to facilitate model selection across diverse applications.

Abstract

As Large Language Models (LLMs) advance in natural language processing, there is growing interest in leveraging their capabilities to simplify software interactions. In this paper, we propose a novel system that integrates LLMs for both classifying natural language inputs into corresponding API calls and automating the creation of sample datasets tailored to specific API functions. By classifying natural language commands, our system allows users to invoke complex software functionalities through simple inputs, improving interaction efficiency and lowering the barrier to software utilization. Our dataset generation approach also enables the efficient and systematic evaluation of different LLMs in classifying API calls, offering a practical tool for developers or business owners to assess the suitability of LLMs for customized API management. We conduct experiments on several prominent LLMs using generated sample datasets for various API functions. The results show that GPT-4 achieves a high classification accuracy of 0.996, while LLaMA-3-8B performs much worse at 0.759. These findings highlight the potential of LLMs to transform API management and validate the effectiveness of our system in guiding model testing and selection across diverse applications.

Harnessing LLMs for API Interactions: A Framework for Classification and Synthetic Data Generation

TL;DR

This work addresses the problem of translating natural language commands into API calls and evaluating LLMs for API management by proposing a two-component framework: an API retrieval system and a synthetic data generation pipeline. The API module maps user queries to calls, executes them with caching, and returns results, while the dataset pipeline generates 1300 labeled synthetic queries across six modules to benchmark multiple LLMs. Results show GPT-4 achieving near-perfect module- and function-level accuracy (-Acc ≈ 0.992, -Acc ≈ 0.996), with markedly lower performance from smaller models like 8B LLaMA, demonstrating significant model-size effects on API classification. The approach delivers a practical, scalable pathway for customizing and evaluating LLMs for API management and provides a reproducible synthetic-data workflow to facilitate model selection across diverse applications.

Abstract

As Large Language Models (LLMs) advance in natural language processing, there is growing interest in leveraging their capabilities to simplify software interactions. In this paper, we propose a novel system that integrates LLMs for both classifying natural language inputs into corresponding API calls and automating the creation of sample datasets tailored to specific API functions. By classifying natural language commands, our system allows users to invoke complex software functionalities through simple inputs, improving interaction efficiency and lowering the barrier to software utilization. Our dataset generation approach also enables the efficient and systematic evaluation of different LLMs in classifying API calls, offering a practical tool for developers or business owners to assess the suitability of LLMs for customized API management. We conduct experiments on several prominent LLMs using generated sample datasets for various API functions. The results show that GPT-4 achieves a high classification accuracy of 0.996, while LLaMA-3-8B performs much worse at 0.759. These findings highlight the potential of LLMs to transform API management and validate the effectiveness of our system in guiding model testing and selection across diverse applications.
Paper Structure (16 sections, 2 equations, 3 figures, 2 tables)

This paper contains 16 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: API Retrieval Framework
  • Figure 2: Dataset Generation
  • Figure 3: Data Generation Rules and Dataset Samples