Table of Contents
Fetching ...

Evaluating and Mitigating Errors in LLM-Generated Web API Integrations

Daniel Maninger, Leon Chemnitz, Amir Molzam Sharifloo, Tushar Lamba, Jannis Brugger, Mira Mezini

TL;DR

The paper tackles the challenge of generating correct web API invocation code with LLMs, revealing that open-source models struggle due to endpoint and argument hallucinations. It introduces WAPIIBench, a dataset and evaluation pipeline built around OpenAPI specifications and Axios in JavaScript, with 395 synthetic endpoint tasks across real-world APIs. After establishing the baseline unconstrained performance, it proposes constrained decoding by automatically translating API specifications into regex constraints, which enforces API compliance during generation. The results show substantial improvements in correctness (average gains of 90–135%) and elimination of illegal endpoints/arguments, making mid-size open-source models competitive with some commercial models. The work provides a practical framework and artifacts for reliable API integration code generation and outlines future directions for broader API support and integration with additional reliability-enhancing techniques.

Abstract

API integration is a cornerstone of our digital infrastructure, enabling software systems to connect and interact. However, as shown by many studies, writing or generating correct code to invoke APIs, particularly web APIs, is challenging. Although large language models (LLMs) have become popular in software development, their effectiveness in automating the generation of web API integration code remains unexplored. In order to address this, we present WAPIIBench, a dataset and evaluation pipeline designed to assess the ability of LLMs to generate web API invocation code. Our experiments with several open-source LLMs reveal that generating API invocations poses a significant challenge, resulting in hallucinated endpoints, incorrect argument usage, and other errors. None of the evaluated open-source models was able to solve more than 40% of the tasks. Motivated by those findings, we explore the potential of constrained decoding for generating API invocations. To this end, we propose an automatic translation from API specifications to constraints. Our approach prevents violations of API usage rules and significantly increases the overall correctness of the generated code, on average by 90% and 135%, depending on the provided starter code.

Evaluating and Mitigating Errors in LLM-Generated Web API Integrations

TL;DR

The paper tackles the challenge of generating correct web API invocation code with LLMs, revealing that open-source models struggle due to endpoint and argument hallucinations. It introduces WAPIIBench, a dataset and evaluation pipeline built around OpenAPI specifications and Axios in JavaScript, with 395 synthetic endpoint tasks across real-world APIs. After establishing the baseline unconstrained performance, it proposes constrained decoding by automatically translating API specifications into regex constraints, which enforces API compliance during generation. The results show substantial improvements in correctness (average gains of 90–135%) and elimination of illegal endpoints/arguments, making mid-size open-source models competitive with some commercial models. The work provides a practical framework and artifacts for reliable API integration code generation and outlines future directions for broader API support and integration with additional reliability-enhancing techniques.

Abstract

API integration is a cornerstone of our digital infrastructure, enabling software systems to connect and interact. However, as shown by many studies, writing or generating correct code to invoke APIs, particularly web APIs, is challenging. Although large language models (LLMs) have become popular in software development, their effectiveness in automating the generation of web API integration code remains unexplored. In order to address this, we present WAPIIBench, a dataset and evaluation pipeline designed to assess the ability of LLMs to generate web API invocation code. Our experiments with several open-source LLMs reveal that generating API invocations poses a significant challenge, resulting in hallucinated endpoints, incorrect argument usage, and other errors. None of the evaluated open-source models was able to solve more than 40% of the tasks. Motivated by those findings, we explore the potential of constrained decoding for generating API invocations. To this end, we propose an automatic translation from API specifications to constraints. Our approach prevents violations of API usage rules and significantly increases the overall correctness of the generated code, on average by 90% and 135%, depending on the provided starter code.

Paper Structure

This paper contains 41 sections, 6 figures, 18 tables.

Figures (6)

  • Figure 1: Benchmark design for evaluating the capabilities of LLMs in generating web API invocation code. (1) Based on an API $a$ and its specifications $s$, an API invocation tasks $t$ and corresponding correct request configurations $c$ are created. (2) For each $t$, the LLM under evaluation generates an API invocation $i$. (3) $i$ is executed in a controlled environment, yielding a request configuration $c'$. (4) $c'$ is compared to $c$ and validated against $s$ to obtain various metrics, shown in Table \ref{['tab:metrics']}.
  • Figure 2: Performance comparison between selected models for full completion (left) and argument completion (right) with unconstrained decoding
  • Figure 3: Constrained decoding framework for generating web API invocations. In regular autoregressive generation, a decoder takes the token sequence $s_1, \dots, s_i$ and predicts the probabilities $p_j$ for each possible next token $t_j$, based on which $s_{i+1}$ is selected. Constrained decoding augments this loop with a constraint check that sets the probability of tokens that fail the check to zero. We derive our constraints from OpenAPI specifications, and they ensure that generated API invocations comply with the specification.
  • Figure 4: Performance comparison between selected models for full completion (left) and argument completion (right) with constrained decoding
  • Figure 5: Relative increase in correct codes through constrained decoding for full (left) and argument completion (right)
  • ...and 1 more figures