Table of Contents
Fetching ...

On Mitigating Code LLM Hallucinations with API Documentation

Nihal Jain, Robert Kwiatkowski, Baishakhi Ray, Murali Krishna Ramanathan, Varun Kumar

TL;DR

The paper tackles API hallucinations in Code LLMs by introducing CloudAPIBench, a frequency-informed benchmark for AWS and Azure APIs, enabling precise measurement of hallucination rates across low, medium, and high-frequency APIs. It shows that Code LLMs struggle more with low-frequency APIs, and that Documentation Augmented Generation (DAG) can significantly boost low-frequency performance but may hurt high-frequency APIs if retrievers are suboptimal. To address this, the authors propose selective retrieval methods—Index Lookup and API Invocation Confidence—and a combined DAG++ approach that judiciously triggers documentation augmentation. Across multiple models, DAG++ yields robust improvements, including an 8.20 percentage point gain for GPT-4o on CloudAPIBench, demonstrating a practical path to more reliable API invocations in real-world software engineering tasks.

Abstract

In this study, we address the issue of API hallucinations in various software engineering contexts. We introduce CloudAPIBench, a new benchmark designed to measure API hallucination occurrences. CloudAPIBench also provides annotations for frequencies of API occurrences in the public domain, allowing us to study API hallucinations at various frequency levels. Our findings reveal that Code LLMs struggle with low frequency APIs: for e.g., GPT-4o achieves only 38.58% valid low frequency API invocations. We demonstrate that Documentation Augmented Generation (DAG) significantly improves performance for low frequency APIs (increase to 47.94% with DAG) but negatively impacts high frequency APIs when using sub-optimal retrievers (a 39.02% absolute drop). To mitigate this, we propose to intelligently trigger DAG where we check against an API index or leverage Code LLMs' confidence scores to retrieve only when needed. We demonstrate that our proposed methods enhance the balance between low and high frequency API performance, resulting in more reliable API invocations (8.20% absolute improvement on CloudAPIBench for GPT-4o).

On Mitigating Code LLM Hallucinations with API Documentation

TL;DR

The paper tackles API hallucinations in Code LLMs by introducing CloudAPIBench, a frequency-informed benchmark for AWS and Azure APIs, enabling precise measurement of hallucination rates across low, medium, and high-frequency APIs. It shows that Code LLMs struggle more with low-frequency APIs, and that Documentation Augmented Generation (DAG) can significantly boost low-frequency performance but may hurt high-frequency APIs if retrievers are suboptimal. To address this, the authors propose selective retrieval methods—Index Lookup and API Invocation Confidence—and a combined DAG++ approach that judiciously triggers documentation augmentation. Across multiple models, DAG++ yields robust improvements, including an 8.20 percentage point gain for GPT-4o on CloudAPIBench, demonstrating a practical path to more reliable API invocations in real-world software engineering tasks.

Abstract

In this study, we address the issue of API hallucinations in various software engineering contexts. We introduce CloudAPIBench, a new benchmark designed to measure API hallucination occurrences. CloudAPIBench also provides annotations for frequencies of API occurrences in the public domain, allowing us to study API hallucinations at various frequency levels. Our findings reveal that Code LLMs struggle with low frequency APIs: for e.g., GPT-4o achieves only 38.58% valid low frequency API invocations. We demonstrate that Documentation Augmented Generation (DAG) significantly improves performance for low frequency APIs (increase to 47.94% with DAG) but negatively impacts high frequency APIs when using sub-optimal retrievers (a 39.02% absolute drop). To mitigate this, we propose to intelligently trigger DAG where we check against an API index or leverage Code LLMs' confidence scores to retrieve only when needed. We demonstrate that our proposed methods enhance the balance between low and high frequency API performance, resulting in more reliable API invocations (8.20% absolute improvement on CloudAPIBench for GPT-4o).
Paper Structure (25 sections, 15 figures, 4 tables)

This paper contains 25 sections, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Introduction.(Left) A CloudAPIBench task (yellow) and StarCoder2-15B's response (red) are displayed. The target is a recently released AWS API bedrock2023, i.e., a low frequency API. Due to limited training on such APIs, the Code LLM hallucinates a non-existent API invocation. (Right) Given a prompt from CloudAPIBench, we measure the perplexity of the target API tokens using StarCoder2-15B (lower is better). The base model handles high frequency APIs well but falters with low frequency ones. While DAG (with imperfect retrievers) improves low frequency API performance, it hurts high frequency API performance due to irrelevant augmentations. This paper's methods and analyses address this limitation of DAG.
  • Figure 2: Composition of CloudAPIBench. (a) The benchmark comprises diverse APIs from various AWS and Azure services. (b) Word cloud visualizing the services in CloudAPIBench; from AWS s3 to Azure computervision, CloudAPIBench comprises many cloud-based software engineering use-cases.
  • Figure 3: Valid API Invocation. Using the API documentation, we create an API stub to capture correct usage. A candidate invocation is valid if it successfully binds to the stub. Here, delete_message requires at least one required argument for successful binding.
  • Figure 4: DAG Overview. Starting with a CloudAPIBench task, we sample an API invocation from the Code LLM. This is used to retrieve documentation for the matching APIs. We then augment the prompt with the documentation and re-trigger the model.
  • Figure 5: API Specification Augmentation. Augmented prompt for the Oracle retriever with one retrieval. The "API Specification" (blue) contains the API name and a list of its required & optional arguments, providing an efficient summary of the documentation.
  • ...and 10 more figures