Table of Contents
Fetching ...

Evaluating LLMs on Sequential API Call Through Automated Test Generation

Yuheng Huang, Jiayang Song, Da Song, Zhenlan Ji, Wenhan Wang, Shuai Wang, Lei Ma

TL;DR

This work tackles the challenge of evaluating LLMs capable of sequential API calls by introducing StateGen, an automated test-generation framework, and StateEval, a 120-case benchmark across RESTful, tensor, and text-to-speech domains. StateGen combines state-machine-driven trace generation, energy-based sampling, and control-flow injection to craft executable programs, which are translated into natural language tasks by a dual-LLM agent system and validated via local oracles. Empirical results show StateGen yields richer, more diverse API-call traces than baselines, while StateEval reveals substantial gaps in current models’ ability to manage complex, interdependent API workflows, with error analysis highlighting misinterpretation of docs, instruction drift, and state management as core failure modes. The publicly released framework and benchmark aim to standardize evaluation of LLM tool usage and spur improvements in end-to-end API-driven reasoning and execution.

Abstract

By integrating tools from external APIs, Large Language Models (LLMs) have expanded their promising capabilities in a diverse spectrum of complex real-world tasks. However, testing, evaluation, and analysis of LLM tool use remain in their early stages. Most existing benchmarks rely on manually collected test cases, many of which cannot be automatically checked for semantic correctness and instead depend on static methods such as string matching. Additionally, these benchmarks often overlook the complex interactions that occur between sequential API calls, which are common in real-world applications. To fill the gap, in this paper, we introduce StateGen, an automated framework designed to generate diverse coding tasks involving sequential API interactions. StateGen combines state-machine-based API constraint solving and validation, energy-based sampling, and control-flow injection to generate executable programs. These programs are then translated into human-like natural language task descriptions through a collaboration of two LLM agents. Utilizing StateGen, we construct StateEval, a benchmark encompassing 120 verified test cases spanning across three representative scenarios: Session Service, Tensor Operation, and ElevenLabs MCP. Experimental results confirm that StateGen can effectively generate challenging and realistic API-oriented tasks, highlighting areas for improvement in current LLMs incorporating APIs.We make our framework and benchmark publicly available to support future research.

Evaluating LLMs on Sequential API Call Through Automated Test Generation

TL;DR

This work tackles the challenge of evaluating LLMs capable of sequential API calls by introducing StateGen, an automated test-generation framework, and StateEval, a 120-case benchmark across RESTful, tensor, and text-to-speech domains. StateGen combines state-machine-driven trace generation, energy-based sampling, and control-flow injection to craft executable programs, which are translated into natural language tasks by a dual-LLM agent system and validated via local oracles. Empirical results show StateGen yields richer, more diverse API-call traces than baselines, while StateEval reveals substantial gaps in current models’ ability to manage complex, interdependent API workflows, with error analysis highlighting misinterpretation of docs, instruction drift, and state management as core failure modes. The publicly released framework and benchmark aim to standardize evaluation of LLM tool usage and spur improvements in end-to-end API-driven reasoning and execution.

Abstract

By integrating tools from external APIs, Large Language Models (LLMs) have expanded their promising capabilities in a diverse spectrum of complex real-world tasks. However, testing, evaluation, and analysis of LLM tool use remain in their early stages. Most existing benchmarks rely on manually collected test cases, many of which cannot be automatically checked for semantic correctness and instead depend on static methods such as string matching. Additionally, these benchmarks often overlook the complex interactions that occur between sequential API calls, which are common in real-world applications. To fill the gap, in this paper, we introduce StateGen, an automated framework designed to generate diverse coding tasks involving sequential API interactions. StateGen combines state-machine-based API constraint solving and validation, energy-based sampling, and control-flow injection to generate executable programs. These programs are then translated into human-like natural language task descriptions through a collaboration of two LLM agents. Utilizing StateGen, we construct StateEval, a benchmark encompassing 120 verified test cases spanning across three representative scenarios: Session Service, Tensor Operation, and ElevenLabs MCP. Experimental results confirm that StateGen can effectively generate challenging and realistic API-oriented tasks, highlighting areas for improvement in current LLMs incorporating APIs.We make our framework and benchmark publicly available to support future research.

Paper Structure

This paper contains 29 sections, 3 equations, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: Motivating example of an LLM with sequential API calls.
  • Figure 2: Workflow overview of StateGen. Trace Generation (Sec \ref{['section:approach:trace']}) forms the backbone of StateGen, producing valid API sequences through state machines while ensuring diversity with energy-based sampling. Program Generation (Sec \ref{['section:approach:program']}) then assembles these traces with appropriate initialization and control flow structures. Finally, Instruction Translation (Sec \ref{['section:approach:translation']}) employs a multi-agent system to convert the generated programs back into natural language descriptions for evaluation.
  • Figure 3: First API
  • Figure 4: Second API
  • Figure 5: Full Example
  • ...and 6 more figures