Table of Contents
Fetching ...

When LLMs Meet API Documentation: Can Retrieval Augmentation Aid Code Generation Just as It Helps Developers?

Jingyi Chen, Songqiang Chen, Jialun Cao, Jiasi Shen, Shing-Chi Cheung

TL;DR

This work investigates how retrieval-augmented generation (RAG) can enhance code generation when using less-common Python libraries by leveraging API documentation. The authors construct a dataset of 1017 APIs across four libraries, extract three components from API docs (description, parameters, examples), and build per-library RAG databases to support code-completion tasks. They evaluate multiple retrievers (BM25, Text3, GTE) and four LLMs, including noise mutations to simulate documentation quality, and reveal that RAG can improve API usage accuracy by 83% to 220%, with example codes contributing the most to success. The findings offer actionable guidance for documenting APIs (prioritize diverse, executable examples) and suggest retriever-model configurations that robustly support low-code software development workflows for less-common libraries.

Abstract

Retrieval-augmented generation (RAG) has increasingly shown its power in extending large language models' (LLMs') capability beyond their pre-trained knowledge. Existing works have shown that RAG can help with software development tasks such as code generation, code update, and test generation. Yet, the effectiveness of adapting LLMs to fast-evolving or less common API libraries using RAG remains unknown. To bridge this gap, we take an initial step to study this unexplored yet practical setting - when developers code with a less common library, they often refer to its API documentation; likewise, when LLMs are allowed to look up API documentation via RAG, to what extent can LLMs be advanced? To mimic such a setting, we select four less common open-source Python libraries with a total of 1017 eligible APIs. We study the factors that affect the effectiveness of using the documentation of less common API libraries as additional knowledge for retrieval and generation. Our intensive study yields interesting findings: (1) RAG helps improve LLMs' performance by 83%-220%. (2) Example code contributes the most to advance LLMs, instead of the descriptive texts and parameter lists in the API documentation. (3) LLMs could sometimes tolerate mild noises (typos in description or incorrect parameters) by referencing their pre-trained knowledge or document context. Finally, we suggest that developers pay more attention to the quality and diversity of the code examples in the API documentation. The study sheds light on future low-code software development workflows.

When LLMs Meet API Documentation: Can Retrieval Augmentation Aid Code Generation Just as It Helps Developers?

TL;DR

This work investigates how retrieval-augmented generation (RAG) can enhance code generation when using less-common Python libraries by leveraging API documentation. The authors construct a dataset of 1017 APIs across four libraries, extract three components from API docs (description, parameters, examples), and build per-library RAG databases to support code-completion tasks. They evaluate multiple retrievers (BM25, Text3, GTE) and four LLMs, including noise mutations to simulate documentation quality, and reveal that RAG can improve API usage accuracy by 83% to 220%, with example codes contributing the most to success. The findings offer actionable guidance for documenting APIs (prioritize diverse, executable examples) and suggest retriever-model configurations that robustly support low-code software development workflows for less-common libraries.

Abstract

Retrieval-augmented generation (RAG) has increasingly shown its power in extending large language models' (LLMs') capability beyond their pre-trained knowledge. Existing works have shown that RAG can help with software development tasks such as code generation, code update, and test generation. Yet, the effectiveness of adapting LLMs to fast-evolving or less common API libraries using RAG remains unknown. To bridge this gap, we take an initial step to study this unexplored yet practical setting - when developers code with a less common library, they often refer to its API documentation; likewise, when LLMs are allowed to look up API documentation via RAG, to what extent can LLMs be advanced? To mimic such a setting, we select four less common open-source Python libraries with a total of 1017 eligible APIs. We study the factors that affect the effectiveness of using the documentation of less common API libraries as additional knowledge for retrieval and generation. Our intensive study yields interesting findings: (1) RAG helps improve LLMs' performance by 83%-220%. (2) Example code contributes the most to advance LLMs, instead of the descriptive texts and parameter lists in the API documentation. (3) LLMs could sometimes tolerate mild noises (typos in description or incorrect parameters) by referencing their pre-trained knowledge or document context. Finally, we suggest that developers pay more attention to the quality and diversity of the code examples in the API documentation. The study sheds light on future low-code software development workflows.

Paper Structure

This paper contains 51 sections, 2 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview of Study Pipeline
  • Figure 2: Extracted API Document Content of ivy.add API
  • Figure 3: A Code Completion Task for ivy.add
  • Figure 4: Transition from Fail to Pass on w/o doc, top5 doc, corr doc on Four Less-Common Libraries
  • Figure 5: API Usage Recommendation Pass Rates of RAG on Documents Lacking Different Contents (Solid and shaded bars illustrate results under setups w/ top5 doc and w/tgt doc, respectively, as introduced in Section \ref{['subsubsec:rq1setup']}).
  • ...and 4 more figures