LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation

Sachit Kuhar; Wasi Uddin Ahmad; Zijian Wang; Nihal Jain; Haifeng Qian; Baishakhi Ray; Murali Krishna Ramanathan; Xiaofei Ma; Anoop Deoras

LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation

Sachit Kuhar, Wasi Uddin Ahmad, Zijian Wang, Nihal Jain, Haifeng Qian, Baishakhi Ray, Murali Krishna Ramanathan, Xiaofei Ma, Anoop Deoras

TL;DR

LibEvolutionEval addresses how code LLMs handle rapidly evolving public libraries by introducing a version-specific code-completion benchmark across eight libraries and a detailed study of PyTorch and Matplotlib. It combines realistic GitHub data and controlled, documentation-driven data, and analyzes how version-aware contexts and retrieved documentation affect performance, using metrics like $F_1$ and $MRR$. The results reveal substantial performance variation with library evolution, partial mitigation through version-aware retrieval, and persistent biases that scaling alone cannot erase. The work highlights practical pathways for improving code completion systems via temporal data, version-aware retrieval, and targeted fine-tuning on versioned corpora.

Abstract

Recent advancements in code completion models have primarily focused on local file contexts. However, these studies do not fully capture the complexity of real-world software development, which often requires the use of rapidly-evolving public libraries. To fill the gap, we introduce LibEvolutionEval, a detailed study requiring an understanding of library evolution to perform in-line code completion accurately. LibEvolutionEval provides a version-specific code-completion task comprised of eight libraries (torch, torchvision, scipy, pil, tqdm, pyyaml, matplotlib, and pandas) as they evolve over the year along with a detailed analysis of the evolution of two popular and well-maintained public libraries: PyTorch and Matplotlib. We evaluate popular public models and find that public library evolution significantly influences model performance. We explored mitigation methods by studying how retrieved version-specific library documentation and prompting can improve the model's capability in handling these fast-evolving packages, paving a promising future path in better handling fast-evolving libraries.

LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation

TL;DR

Abstract

LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)