Table of Contents
Fetching ...

VersiCode: Towards Version-controllable Code Generation

Tongtong Wu, Weigang Wu, Xingyu Wang, Kang Xu, Suyu Ma, Bo Jiang, Ping Yang, Zhenchang Xing, Yuan-Fang Li, Gholamreza Haffari

TL;DR

An extensive evaluation on VersiCode reveals that version-controllable code generation is indeed a significant challenge, even for GPT-4o and other strong frontier models, and proposes two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM).

Abstract

Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development, marked by frequent library updates. This gap significantly limits LLMs' deployment in realistic settings. In this paper, we propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM). In conjunction, we introduce VersiCode, a comprehensive Python dataset specifically designed to evaluate LLMs on these two tasks, together with a novel evaluation metric, Critical Diff Check (CDC@1), which assesses code generation against evolving API requirements. We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge, even for GPT-4o and other strong frontier models. We believe the novel tasks, dataset, and metric open up a new, important research direction that will further enhance LLMs' real-world applicability. The code and resources can be found at https://github.com/wutong8023/VersiCode.

VersiCode: Towards Version-controllable Code Generation

TL;DR

An extensive evaluation on VersiCode reveals that version-controllable code generation is indeed a significant challenge, even for GPT-4o and other strong frontier models, and proposes two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM).

Abstract

Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development, marked by frequent library updates. This gap significantly limits LLMs' deployment in realistic settings. In this paper, we propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM). In conjunction, we introduce VersiCode, a comprehensive Python dataset specifically designed to evaluate LLMs on these two tasks, together with a novel evaluation metric, Critical Diff Check (CDC@1), which assesses code generation against evolving API requirements. We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge, even for GPT-4o and other strong frontier models. We believe the novel tasks, dataset, and metric open up a new, important research direction that will further enhance LLMs' real-world applicability. The code and resources can be found at https://github.com/wutong8023/VersiCode.
Paper Structure (34 sections, 17 figures, 12 tables)

This paper contains 34 sections, 17 figures, 12 tables.

Figures (17)

  • Figure 1: Two motivating scenarios for version-controllable code generation: (left) Interacting with LLMs in a browser, where slight query changes lead to incorrect answers, and (right) Programming in an IDE, explicitly specifying the version of dependency libraries.
  • Figure 2: The post-processing pipeline transforms metadata into specific tasks and the running example per task: (left) Leveraging pairs of metadata that share the same functionality but different library versions to construct block-level code migration instances; (right) Utilizing each metadata sample, masking version-sensitive content to create multi-granularity code completion instances.
  • Figure 3: The EM@1 results for token-level code completion from VersiCode: (a1) Comparison with existing benchmark datasets, (a2) Performance grouped by data sources, and (b) Performance grouped by API lifecycle.
  • Figure 4: The EM@1 performance for token-level code completion, grouped by year (2015-2023), with a histogram of data distribution for each year.
  • Figure 5: The process of executable code assessment, which includes data refactoring, test case generation, and validation. Starting from code snippets collected from real code involving specific API calls for a given library version, GPT-4 is employed to refactor the code into a task function. The large language model is then prompted to generate test cases from various perspectives (See Appendix \ref{['apd:example_execute']} for a running example of instances and test cases.) Each generated test case is verified by experts, and the correctness is ensured by running the code in a specified environment. If issues arise, they are corrected through multiple iterations with GPT-4.
  • ...and 12 more figures