SEAL: Suite for Evaluating API-use of LLMs

Woojeong Kim; Ashish Jagmohan; Aditya Vempaty

SEAL: Suite for Evaluating API-use of LLMs

Woojeong Kim, Ashish Jagmohan, Aditya Vempaty

TL;DR

Sealed, an end-to-end testbed designed to evaluate LLMs in real-world API usage, standardizes existing benchmarks, integrates an agent system for testing API retrieval and planning, and addresses the instability of real-time APIs by introducing a GPT-4-powered API simulator with caching for deterministic evaluations.

Abstract

Large language models (LLMs) have limitations in handling tasks that require real-time access to external APIs. While several benchmarks like ToolBench and APIGen have been developed to assess LLMs' API-use capabilities, they often suffer from issues such as lack of generalizability, limited multi-step reasoning coverage, and instability due to real-time API fluctuations. In this paper, we introduce SEAL, an end-to-end testbed designed to evaluate LLMs in real-world API usage. SEAL standardizes existing benchmarks, integrates an agent system for testing API retrieval and planning, and addresses the instability of real-time APIs by introducing a GPT-4-powered API simulator with caching for deterministic evaluations. Our testbed provides a comprehensive evaluation pipeline that covers API retrieval, API calls, and final responses, offering a reliable framework for structured performance comparison in diverse real-world scenarios. SEAL is publicly available, with ongoing updates for new benchmarks.

SEAL: Suite for Evaluating API-use of LLMs

TL;DR

Abstract

Paper Structure (29 sections, 5 figures, 3 tables)

This paper contains 29 sections, 5 figures, 3 tables.

Introduction
Overall Landscape
Challenges in Existing Benchmarks
Lack of Generalizability
Bias Towards Simple Queries
General Instability
Incomplete Evaluation
SEAL Construction
Benchmark Standardization & Sanitization
Agent System Construction
API Retriever
API Simulator
Evaluation Pipeline
Results & Analysis
Incorrect API Retrieval
...and 14 more sections

Figures (5)

Figure 1: Workflow of a single-step, single-API-use system
Figure 2: Comparison of embedding models' API retrieval performance across benchmarks. We fixed the total number of sampled queries and report the average performance over 10 sampling runs. For ToolBench, results are based on the test split, as the ToolBench retriever was trained on the train split.
Figure 3: AutoGen system architecture
Figure 4: SEAL execution results on two benchmarks.
Figure 5: Execution example of SEAL

SEAL: Suite for Evaluating API-use of LLMs

TL;DR

Abstract

SEAL: Suite for Evaluating API-use of LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (5)