Table of Contents
Fetching ...

AppBench: Planning of Multiple APIs from Various APPs for Complex User Instruction

Hongru Wang, Rui Wang, Boyang Xue, Heming Xia, Jingtao Cao, Zeming Liu, Jeff Z. Pan, Kam-Fai Wong

TL;DR

MetaBench is introduced, the first benchmark to evaluate LLMs’ ability to plan and execute multiple APIs from various sources in order to complete the user’s task and considers two significant challenges in multiple APIs.

Abstract

Large Language Models (LLMs) can interact with the real world by connecting with versatile external APIs, resulting in better problem-solving and task automation capabilities. Previous research primarily focuses on APIs with limited arguments from a single source or overlooks the complex dependency relationship between different APIs. However, it is essential to utilize multiple APIs collaboratively from various sources (e.g., different Apps in the iPhone), especially for complex user instructions. In this paper, we introduce \texttt{AppBench}, the first benchmark to evaluate LLMs' ability to plan and execute multiple APIs from various sources in order to complete the user's task. Specifically, we consider two significant challenges in multiple APIs: \textit{1) graph structures:} some APIs can be executed independently while others need to be executed one by one, resulting in graph-like execution order; and \textit{2) permission constraints:} which source is authorized to execute the API call. We have experimental results on 9 distinct LLMs; e.g., GPT-4o achieves only a 2.0\% success rate at the most complex instruction, revealing that the existing state-of-the-art LLMs still cannot perform well in this situation even with the help of in-context learning and finetuning. Our code and data are publicly available at https://github.com/ruleGreen/AppBench.

AppBench: Planning of Multiple APIs from Various APPs for Complex User Instruction

TL;DR

MetaBench is introduced, the first benchmark to evaluate LLMs’ ability to plan and execute multiple APIs from various sources in order to complete the user’s task and considers two significant challenges in multiple APIs.

Abstract

Large Language Models (LLMs) can interact with the real world by connecting with versatile external APIs, resulting in better problem-solving and task automation capabilities. Previous research primarily focuses on APIs with limited arguments from a single source or overlooks the complex dependency relationship between different APIs. However, it is essential to utilize multiple APIs collaboratively from various sources (e.g., different Apps in the iPhone), especially for complex user instructions. In this paper, we introduce \texttt{AppBench}, the first benchmark to evaluate LLMs' ability to plan and execute multiple APIs from various sources in order to complete the user's task. Specifically, we consider two significant challenges in multiple APIs: \textit{1) graph structures:} some APIs can be executed independently while others need to be executed one by one, resulting in graph-like execution order; and \textit{2) permission constraints:} which source is authorized to execute the API call. We have experimental results on 9 distinct LLMs; e.g., GPT-4o achieves only a 2.0\% success rate at the most complex instruction, revealing that the existing state-of-the-art LLMs still cannot perform well in this situation even with the help of in-context learning and finetuning. Our code and data are publicly available at https://github.com/ruleGreen/AppBench.

Paper Structure

This paper contains 41 sections, 2 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: An example of one user instruction requires two independent APIs from different APPs since input arguments of both two APIs do not rely on each other. We use different icons to indicate different APPs, and color API, and returned arguments and input arguments.
  • Figure 2: A high-level processing to collect the AppBench, taking advantages of existing task-oriented dialogue datasets.
  • Figure 3: An example of different types of samples in AppBench. We color APP, API, and returned arguments and input arguments. We also present the structure of the example using grey nodes and colorful nodes to indicate user instruction and APIs from different APPs, respectively. We bold the argument which is returned by the previous API call (a.k.a., dependency relationship). Para. and Seq. represents the parallel and sequential size of the corresponding data sample. We emphasize we only choose the simplest examples in each type for better understanding, there are data samples with much more complex logic structures in the original dataset.
  • Figure 4: The relationship between GPT-4o's performance with parallel and sequential scaling. Both parallel and sequential scaling cause challenges for model performance.
  • Figure 5: The performance gap between hierarchical and flat prompting on GPT-3.5 and GPT-4o.
  • ...and 1 more figures