Table of Contents
Fetching ...

Identifying and Mitigating API Misuse in Large Language Models

Terry Yue Zhuo, Junda He, Jiamou Sun, Zhenchang Xing, David Lo, John Grundy, Xiaoning Du

TL;DR

The paper investigates how large language models misuse APIs when generating code in real-world contexts, focusing on Python and Java across three representative decoders. It remaps API-misuse taxonomies to include LLM-specific patterns (intent misuse, hallucination) and conducts a large-scale manual annotation to reveal prevalent misuse types. To address these issues, the authors introduce Dr.Fix, a taxonomy-guided, multi-stage LLM-based repair framework that detects, reasons about, and repairs API misuses, achieving substantial gains in BLEU and exact-match metrics and improving refusal rates. The work provides valuable empirical insights, a public dataset, and a replication package, offering a path toward more reliable, automated API usage in code-generation systems.

Abstract

API misuse in code generated by large language models (LLMs) presents a serious and growing challenge in software development, as although LLMs demonstrate impressive code generation capabilities, their interactions with complex library APIs are often error-prone and can lead to software failures and vulnerabilities. In this paper, we conduct a large-scale study of API misuse patterns in LLM-generated code by analyzing both method selection and parameter usage across Python and Java, using three representative LLMs: StarCoder-7B, Qwen2.5-Coder-7B, and GitHub Copilot. Based on extensive manual annotation of 3,209 method-level and 3,492 parameter-level misuses, we identify and categorize four recurring misuse types by building on and refining prior API misuse taxonomies. Our evaluation of the three LLMs reveals persistent challenges in API usage, particularly hallucination and intent misalignment. To address these issues, we propose Dr.Fix, an LLM-based automatic repair approach guided by our taxonomy, which improves repair accuracy compared to baseline prompting and existing repair methods, achieving gains of up to 38.4 BLEU and 40% exact match on benchmark datasets. This work offers important insights into the current limitations of LLMs in API usage and points to directions for improving automated misuse repair in code generation systems.

Identifying and Mitigating API Misuse in Large Language Models

TL;DR

The paper investigates how large language models misuse APIs when generating code in real-world contexts, focusing on Python and Java across three representative decoders. It remaps API-misuse taxonomies to include LLM-specific patterns (intent misuse, hallucination) and conducts a large-scale manual annotation to reveal prevalent misuse types. To address these issues, the authors introduce Dr.Fix, a taxonomy-guided, multi-stage LLM-based repair framework that detects, reasons about, and repairs API misuses, achieving substantial gains in BLEU and exact-match metrics and improving refusal rates. The work provides valuable empirical insights, a public dataset, and a replication package, offering a path toward more reliable, automated API usage in code-generation systems.

Abstract

API misuse in code generated by large language models (LLMs) presents a serious and growing challenge in software development, as although LLMs demonstrate impressive code generation capabilities, their interactions with complex library APIs are often error-prone and can lead to software failures and vulnerabilities. In this paper, we conduct a large-scale study of API misuse patterns in LLM-generated code by analyzing both method selection and parameter usage across Python and Java, using three representative LLMs: StarCoder-7B, Qwen2.5-Coder-7B, and GitHub Copilot. Based on extensive manual annotation of 3,209 method-level and 3,492 parameter-level misuses, we identify and categorize four recurring misuse types by building on and refining prior API misuse taxonomies. Our evaluation of the three LLMs reveals persistent challenges in API usage, particularly hallucination and intent misalignment. To address these issues, we propose Dr.Fix, an LLM-based automatic repair approach guided by our taxonomy, which improves repair accuracy compared to baseline prompting and existing repair methods, achieving gains of up to 38.4 BLEU and 40% exact match on benchmark datasets. This work offers important insights into the current limitations of LLMs in API usage and points to directions for improving automated misuse repair in code generation systems.

Paper Structure

This paper contains 41 sections, 4 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: An example of LLM code completion suggesting a PyTorch API call for the last line.
  • Figure 2: Example for API Method Infilling: The model should predict 'get' as the appropriate method name.
  • Figure 3: Example for API Parameter Completion: The model should predict 'url' as the appropriate parameter.
  • Figure 4: A diff-formatted example of intent misuse: Incorrect use of abs instead of magnitude in Python.
  • Figure 5: Diff-formatted Examples of hallucination misuse: non-existent setCubic instead of checkIntervalContains in Java.
  • ...and 5 more figures