Table of Contents
Fetching ...

Assessing the Answerability of Queries in Retrieval-Augmented Code Generation

Geonmin Kim, Jaeyeon Kim, Hancheol Park, Wooksu Shin, Tae-Ho Kim

TL;DR

This study proposes a task for evaluating answerability, which assesses whether valid answers can be generated based on users' queries and retrieved APIs in RaCG and builds a benchmark dataset to evaluate the performance of models performing this task.

Abstract

Thanks to unprecedented language understanding and generation capabilities of large language model (LLM), Retrieval-augmented Code Generation (RaCG) has recently been widely utilized among software developers. While this has increased productivity, there are still frequent instances of incorrect codes being provided. In particular, there are cases where plausible yet incorrect codes are generated for queries from users that cannot be answered with the given queries and API descriptions. This study proposes a task for evaluating answerability, which assesses whether valid answers can be generated based on users' queries and retrieved APIs in RaCG. Additionally, we build a benchmark dataset called Retrieval-augmented Code Generability Evaluation (RaCGEval) to evaluate the performance of models performing this task. Experimental results show that this task remains at a very challenging level, with baseline models exhibiting a low performance of 46.7%. Furthermore, this study discusses methods that could significantly improve performance.

Assessing the Answerability of Queries in Retrieval-Augmented Code Generation

TL;DR

This study proposes a task for evaluating answerability, which assesses whether valid answers can be generated based on users' queries and retrieved APIs in RaCG and builds a benchmark dataset to evaluate the performance of models performing this task.

Abstract

Thanks to unprecedented language understanding and generation capabilities of large language model (LLM), Retrieval-augmented Code Generation (RaCG) has recently been widely utilized among software developers. While this has increased productivity, there are still frequent instances of incorrect codes being provided. In particular, there are cases where plausible yet incorrect codes are generated for queries from users that cannot be answered with the given queries and API descriptions. This study proposes a task for evaluating answerability, which assesses whether valid answers can be generated based on users' queries and retrieved APIs in RaCG. Additionally, we build a benchmark dataset called Retrieval-augmented Code Generability Evaluation (RaCGEval) to evaluate the performance of models performing this task. Experimental results show that this task remains at a very challenging level, with baseline models exhibiting a low performance of 46.7%. Furthermore, this study discusses methods that could significantly improve performance.

Paper Structure

This paper contains 22 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: An example of an LLM generating plausible code even when the request is made outside the functionality provided by the library. When generating code based on the API documentation of NetsPresso netspresso2024, a deep learning model optimization library, an LLM generates the response to a query requesting web page creation using Netspresso API.
  • Figure 2: The process of evaluating the answerability of a user's query in RaCG and generating code accordingly or rejecting code generation
  • Figure 3: Examples of NegGen method to build unanswerable/partially answerable samples. The API highlighted in bold is the gold API for the given query in the answerable set.
  • Figure 4: Effects of in-context learning for domain adaptation scenario.
  • Figure 5: Trade-off between coverage and precision of generated code is explored across various answerability assessment models. We average the pass@k-values over 5% bins of coverage.
  • ...and 1 more figures