Evaluating and Mitigating Errors in LLM-Generated Web API Integrations
Daniel Maninger, Leon Chemnitz, Amir Molzam Sharifloo, Tushar Lamba, Jannis Brugger, Mira Mezini
TL;DR
The paper tackles the challenge of generating correct web API invocation code with LLMs, revealing that open-source models struggle due to endpoint and argument hallucinations. It introduces WAPIIBench, a dataset and evaluation pipeline built around OpenAPI specifications and Axios in JavaScript, with 395 synthetic endpoint tasks across real-world APIs. After establishing the baseline unconstrained performance, it proposes constrained decoding by automatically translating API specifications into regex constraints, which enforces API compliance during generation. The results show substantial improvements in correctness (average gains of 90–135%) and elimination of illegal endpoints/arguments, making mid-size open-source models competitive with some commercial models. The work provides a practical framework and artifacts for reliable API integration code generation and outlines future directions for broader API support and integration with additional reliability-enhancing techniques.
Abstract
API integration is a cornerstone of our digital infrastructure, enabling software systems to connect and interact. However, as shown by many studies, writing or generating correct code to invoke APIs, particularly web APIs, is challenging. Although large language models (LLMs) have become popular in software development, their effectiveness in automating the generation of web API integration code remains unexplored. In order to address this, we present WAPIIBench, a dataset and evaluation pipeline designed to assess the ability of LLMs to generate web API invocation code. Our experiments with several open-source LLMs reveal that generating API invocations poses a significant challenge, resulting in hallucinated endpoints, incorrect argument usage, and other errors. None of the evaluated open-source models was able to solve more than 40% of the tasks. Motivated by those findings, we explore the potential of constrained decoding for generating API invocations. To this end, we propose an automatic translation from API specifications to constraints. Our approach prevents violations of API usage rules and significantly increases the overall correctness of the generated code, on average by 90% and 135%, depending on the provided starter code.
