Table of Contents
Fetching ...

SEAL: Suite for Evaluating API-use of LLMs

Woojeong Kim, Ashish Jagmohan, Aditya Vempaty

TL;DR

Sealed, an end-to-end testbed designed to evaluate LLMs in real-world API usage, standardizes existing benchmarks, integrates an agent system for testing API retrieval and planning, and addresses the instability of real-time APIs by introducing a GPT-4-powered API simulator with caching for deterministic evaluations.

Abstract

Large language models (LLMs) have limitations in handling tasks that require real-time access to external APIs. While several benchmarks like ToolBench and APIGen have been developed to assess LLMs' API-use capabilities, they often suffer from issues such as lack of generalizability, limited multi-step reasoning coverage, and instability due to real-time API fluctuations. In this paper, we introduce SEAL, an end-to-end testbed designed to evaluate LLMs in real-world API usage. SEAL standardizes existing benchmarks, integrates an agent system for testing API retrieval and planning, and addresses the instability of real-time APIs by introducing a GPT-4-powered API simulator with caching for deterministic evaluations. Our testbed provides a comprehensive evaluation pipeline that covers API retrieval, API calls, and final responses, offering a reliable framework for structured performance comparison in diverse real-world scenarios. SEAL is publicly available, with ongoing updates for new benchmarks.

SEAL: Suite for Evaluating API-use of LLMs

TL;DR

Sealed, an end-to-end testbed designed to evaluate LLMs in real-world API usage, standardizes existing benchmarks, integrates an agent system for testing API retrieval and planning, and addresses the instability of real-time APIs by introducing a GPT-4-powered API simulator with caching for deterministic evaluations.

Abstract

Large language models (LLMs) have limitations in handling tasks that require real-time access to external APIs. While several benchmarks like ToolBench and APIGen have been developed to assess LLMs' API-use capabilities, they often suffer from issues such as lack of generalizability, limited multi-step reasoning coverage, and instability due to real-time API fluctuations. In this paper, we introduce SEAL, an end-to-end testbed designed to evaluate LLMs in real-world API usage. SEAL standardizes existing benchmarks, integrates an agent system for testing API retrieval and planning, and addresses the instability of real-time APIs by introducing a GPT-4-powered API simulator with caching for deterministic evaluations. Our testbed provides a comprehensive evaluation pipeline that covers API retrieval, API calls, and final responses, offering a reliable framework for structured performance comparison in diverse real-world scenarios. SEAL is publicly available, with ongoing updates for new benchmarks.
Paper Structure (29 sections, 5 figures, 3 tables)

This paper contains 29 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Workflow of a single-step, single-API-use system
  • Figure 2: Comparison of embedding models' API retrieval performance across benchmarks. We fixed the total number of sampled queries and report the average performance over 10 sampling runs. For ToolBench, results are based on the test split, as the ToolBench retriever was trained on the train split.
  • Figure 3: AutoGen system architecture
  • Figure 4: SEAL execution results on two benchmarks.
  • Figure 5: Execution example of SEAL