Table of Contents
Fetching ...

CodeUpdateArena: Benchmarking Knowledge Editing on API Updates

Zeyu Leo Liu, Shrey Pandit, Xi Ye, Eunsol Choi, Greg Durrett

TL;DR

CodeUpdateArena addresses the challenge that code LLMs must keep pace with evolving APIs by constructing a synthetic, large-scale benchmark of API updates paired with program-synthesis tasks. The dataset comprises 670 PS problems across 54 functions in 7 Python packages, generated via GPT-4 and filtered for quality, enabling evaluation of knowledge updating methods beyond in-context prompts. Key findings show that simple prepending of update documentation is insufficient, while fine-tuning on program-synthesis examples (FT(PS)) robustly improves performance, albeit with some specificity and forgetting tradeoffs. The work provides a foundation for future methods in code knowledge updating and releases resources to spur advances in semantically aware API adaptation for code LLMs.

Abstract

Large language models (LLMs) are increasingly being used to synthesize and reason about source code. However, the static nature of these models' knowledge does not reflect the fact that libraries and API functions they invoke are continuously evolving, with functionality being added or changing. While numerous benchmarks evaluate how LLMs can generate code, no prior work has studied how an LLMs' knowledge about code API functions can be updated. To fill this gap, we present CodeUpdateArena, a benchmark for knowledge editing in the code domain. An instance in our benchmark consists of a synthetic API function update paired with a program synthesis example that uses the updated functionality; our goal is to update an LLM to be able to solve this program synthesis example without providing documentation of the update at inference time. Compared to knowledge editing for facts encoded in text, success here is more challenging: a code LLM must correctly reason about the semantics of the modified function rather than just reproduce its syntax. Our dataset is constructed by first prompting GPT-4 to generate atomic and executable function updates. Then, for each update, we generate program synthesis examples whose code solutions are prone to use the update. Our benchmark covers updates of various types to 54 functions from seven diverse Python packages, with a total of 670 program synthesis examples. Our experiments show that prepending documentation of the update to open-source code LLMs (i.e., DeepSeek, CodeLlama) does not allow them to incorporate changes for problem solving, and existing knowledge editing techniques also have substantial room for improvement. We hope our benchmark will inspire new methods for knowledge updating in code LLMs.

CodeUpdateArena: Benchmarking Knowledge Editing on API Updates

TL;DR

CodeUpdateArena addresses the challenge that code LLMs must keep pace with evolving APIs by constructing a synthetic, large-scale benchmark of API updates paired with program-synthesis tasks. The dataset comprises 670 PS problems across 54 functions in 7 Python packages, generated via GPT-4 and filtered for quality, enabling evaluation of knowledge updating methods beyond in-context prompts. Key findings show that simple prepending of update documentation is insufficient, while fine-tuning on program-synthesis examples (FT(PS)) robustly improves performance, albeit with some specificity and forgetting tradeoffs. The work provides a foundation for future methods in code knowledge updating and releases resources to spur advances in semantically aware API adaptation for code LLMs.

Abstract

Large language models (LLMs) are increasingly being used to synthesize and reason about source code. However, the static nature of these models' knowledge does not reflect the fact that libraries and API functions they invoke are continuously evolving, with functionality being added or changing. While numerous benchmarks evaluate how LLMs can generate code, no prior work has studied how an LLMs' knowledge about code API functions can be updated. To fill this gap, we present CodeUpdateArena, a benchmark for knowledge editing in the code domain. An instance in our benchmark consists of a synthetic API function update paired with a program synthesis example that uses the updated functionality; our goal is to update an LLM to be able to solve this program synthesis example without providing documentation of the update at inference time. Compared to knowledge editing for facts encoded in text, success here is more challenging: a code LLM must correctly reason about the semantics of the modified function rather than just reproduce its syntax. Our dataset is constructed by first prompting GPT-4 to generate atomic and executable function updates. Then, for each update, we generate program synthesis examples whose code solutions are prone to use the update. Our benchmark covers updates of various types to 54 functions from seven diverse Python packages, with a total of 670 program synthesis examples. Our experiments show that prepending documentation of the update to open-source code LLMs (i.e., DeepSeek, CodeLlama) does not allow them to incorporate changes for problem solving, and existing knowledge editing techniques also have substantial room for improvement. We hope our benchmark will inspire new methods for knowledge updating in code LLMs.
Paper Structure (63 sections, 1 equation, 12 figures, 11 tables)

This paper contains 63 sections, 1 equation, 12 figures, 11 tables.

Figures (12)

  • Figure 1: CodeUpdateArena overview. We generate synthetic API updates, and then evaluate whether an edited model can successfully apply the updated API on a targeted program synthesis instance.
  • Figure 2: Overview of CodeUpdateArena generation pipeline. We first generate a spec for an update, unit tests for an update, and then the update's implementation. To generate program synthesis examples, we take an update, generate a problem specification, tests, and then a reference solution.
  • Figure 3: Sensitivity test on learning rate. Sensitivity for specificity is model-specific and may have trade-offs with efficacy. A large enough learning rate (e.g. $1$e-$3$) is required to outperform the prepend setting.
  • Figure 4: Example of unit test skeleton
  • Figure 5: Package breakdown of updated functions in CodeUpdateArena
  • ...and 7 more figures