Table of Contents
Fetching ...

LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation

Sachit Kuhar, Wasi Uddin Ahmad, Zijian Wang, Nihal Jain, Haifeng Qian, Baishakhi Ray, Murali Krishna Ramanathan, Xiaofei Ma, Anoop Deoras

TL;DR

LibEvolutionEval addresses how code LLMs handle rapidly evolving public libraries by introducing a version-specific code-completion benchmark across eight libraries and a detailed study of PyTorch and Matplotlib. It combines realistic GitHub data and controlled, documentation-driven data, and analyzes how version-aware contexts and retrieved documentation affect performance, using metrics like $F_1$ and $MRR$. The results reveal substantial performance variation with library evolution, partial mitigation through version-aware retrieval, and persistent biases that scaling alone cannot erase. The work highlights practical pathways for improving code completion systems via temporal data, version-aware retrieval, and targeted fine-tuning on versioned corpora.

Abstract

Recent advancements in code completion models have primarily focused on local file contexts. However, these studies do not fully capture the complexity of real-world software development, which often requires the use of rapidly-evolving public libraries. To fill the gap, we introduce LibEvolutionEval, a detailed study requiring an understanding of library evolution to perform in-line code completion accurately. LibEvolutionEval provides a version-specific code-completion task comprised of eight libraries (torch, torchvision, scipy, pil, tqdm, pyyaml, matplotlib, and pandas) as they evolve over the year along with a detailed analysis of the evolution of two popular and well-maintained public libraries: PyTorch and Matplotlib. We evaluate popular public models and find that public library evolution significantly influences model performance. We explored mitigation methods by studying how retrieved version-specific library documentation and prompting can improve the model's capability in handling these fast-evolving packages, paving a promising future path in better handling fast-evolving libraries.

LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation

TL;DR

LibEvolutionEval addresses how code LLMs handle rapidly evolving public libraries by introducing a version-specific code-completion benchmark across eight libraries and a detailed study of PyTorch and Matplotlib. It combines realistic GitHub data and controlled, documentation-driven data, and analyzes how version-aware contexts and retrieved documentation affect performance, using metrics like and . The results reveal substantial performance variation with library evolution, partial mitigation through version-aware retrieval, and persistent biases that scaling alone cannot erase. The work highlights practical pathways for improving code completion systems via temporal data, version-aware retrieval, and targeted fine-tuning on versioned corpora.

Abstract

Recent advancements in code completion models have primarily focused on local file contexts. However, these studies do not fully capture the complexity of real-world software development, which often requires the use of rapidly-evolving public libraries. To fill the gap, we introduce LibEvolutionEval, a detailed study requiring an understanding of library evolution to perform in-line code completion accurately. LibEvolutionEval provides a version-specific code-completion task comprised of eight libraries (torch, torchvision, scipy, pil, tqdm, pyyaml, matplotlib, and pandas) as they evolve over the year along with a detailed analysis of the evolution of two popular and well-maintained public libraries: PyTorch and Matplotlib. We evaluate popular public models and find that public library evolution significantly influences model performance. We explored mitigation methods by studying how retrieved version-specific library documentation and prompting can improve the model's capability in handling these fast-evolving packages, paving a promising future path in better handling fast-evolving libraries.

Paper Structure

This paper contains 43 sections, 13 figures, 11 tables.

Figures (13)

  • Figure 1: An example of a code completion scenario under LibEvolutionEval. The incomplete code snippet on the left requires the correct API method to solve a linear system specified by two PyTorch tensors. The code LLM performs incorrect code completions due to version mismatch. The version-specific documentation is a potential augmentations that can assist the LLM to perform correct and version-dependent completion.
  • Figure 2: LibEvolutionEval's preprocessing pipeline to obtain version-specific code-completions meta-data including documentation, NL instructions, and API annotation.
  • Figure 3: APIs classification based on completion type.
  • Figure 4: Illustration of the evolution of PyTorch and Matplotlib public libraries over time. This highlights the rapid evolution of modern public libraries.
  • Figure 5: Illustration of the code completion performance of the Starcoder2, Mistral, and GPT-4o-mini models by measuring the F1 score. The performance of code LLMs varies significantly as libraries evolve.
  • ...and 8 more figures