Table of Contents
Fetching ...

EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

Wayne Chi, Valerie Chen, Ryan Shar, Aditya Mittal, Jenny Liang, Wei-Lin Chiang, Anastasios Nikolas Angelopoulos, Ion Stoica, Graham Neubig, Ameet Talwalkar, Chris Donahue

TL;DR

EDIT-Bench introduces the first in-the-wild benchmark for instructed code edits, grounding evaluation in real user instructions, code context, highlighted regions, and cursor position collected via a VSCode extension from hundreds of developers. By compiling 540 problems across five natural languages and two programming languages and testing 40 LLMs, the study shows that the task remains challenging, with only a single model surpassing 60% pass@1, and demonstrates that detailed context significantly influences performance. The benchmark reveals category- and context-dependent variability, weak correlations with existing edit benchmarks, and emphasizes the need for realistic, diverse data and test harnesses to drive progress in LLM-powered coding tools. Overall, EDIT-Bench provides a practical, scalable framework for evaluating and improving real-world code-editing capabilities in LLMs, informing future model training and benchmarking efforts.

Abstract

Instructed code editing, where LLMs directly modify a developer's existing code based on a user instruction, is becoming a widely used interaction mode in AI coding assistants. However, few benchmarks directly evaluate this capability and current datasets often rely on artificial sources. We introduce EDIT-Bench, a benchmark for evaluating LLM code editing capabilities grounded in real-world usage, i.e., user instructions and code contexts collected in the wild. EDIT-Bench comprises of 540 problems, multiple natural and programming languages, and a diverse set of real-world use cases, ranging from resolving errors to adding features. EDIT-Bench introduces context-dependent problems that require the model to understand code context, highlighted code, and cursor position in addition to the user instruction. We evaluate 40 diverse LLMs and observe that EDIT-Bench is a challenging set of problems where only 1 model scores over 60%. We find that model performance varies across different categories of user instructions. Further, we find that varying levels of contextual information greatly affect task success rate, with performance varying up to 11%, indicating the importance of evaluating with realistic context.

EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

TL;DR

EDIT-Bench introduces the first in-the-wild benchmark for instructed code edits, grounding evaluation in real user instructions, code context, highlighted regions, and cursor position collected via a VSCode extension from hundreds of developers. By compiling 540 problems across five natural languages and two programming languages and testing 40 LLMs, the study shows that the task remains challenging, with only a single model surpassing 60% pass@1, and demonstrates that detailed context significantly influences performance. The benchmark reveals category- and context-dependent variability, weak correlations with existing edit benchmarks, and emphasizes the need for realistic, diverse data and test harnesses to drive progress in LLM-powered coding tools. Overall, EDIT-Bench provides a practical, scalable framework for evaluating and improving real-world code-editing capabilities in LLMs, informing future model training and benchmarking efforts.

Abstract

Instructed code editing, where LLMs directly modify a developer's existing code based on a user instruction, is becoming a widely used interaction mode in AI coding assistants. However, few benchmarks directly evaluate this capability and current datasets often rely on artificial sources. We introduce EDIT-Bench, a benchmark for evaluating LLM code editing capabilities grounded in real-world usage, i.e., user instructions and code contexts collected in the wild. EDIT-Bench comprises of 540 problems, multiple natural and programming languages, and a diverse set of real-world use cases, ranging from resolving errors to adding features. EDIT-Bench introduces context-dependent problems that require the model to understand code context, highlighted code, and cursor position in addition to the user instruction. We evaluate 40 diverse LLMs and observe that EDIT-Bench is a challenging set of problems where only 1 model scores over 60%. We find that model performance varies across different categories of user instructions. Further, we find that varying levels of contextual information greatly affect task success rate, with performance varying up to 11%, indicating the importance of evaluating with realistic context.

Paper Structure

This paper contains 24 sections, 14 figures, 6 tables.

Figures (14)

  • Figure 1: EDIT-Bench tests LLMs' real-world editing capabilities. We propose EDIT-Bench, an evaluation on real user instructions and code snippets collected in-the-wild. It is the first benchmark for instructed code edits that requires models to ingest the user instruction, current code, highlighted code, and cursor position to solve problems.
  • Figure 2: We develop an open-source VSCode extension to collect real-world edits.
  • Figure 3: Distribution of libraries in EDIT-Bench for Python problems.EDIT-Bench contains 74 unique imports compared to 25 (CanItEdit), 15 (Polyglot), and 16 (EditEval) from other benchmarks. See Appendix \ref{['appdx:editbench']} for other languages and other benchmarks.
  • Figure 4: We evaluate 40 LLMs on EDIT-Bench. We report the pass@1 of each model; only 1 out of 40 models have a pass@1 greater than 60%. In general, closed-source models outperform open models.
  • Figure 5: Comparing top-performing open-weight and closed models. To illustrate individual LLM differences, we compare 7 models and find pass@1 varies greatly depending on the problem category. Additionally, different models perform best at different categories.
  • ...and 9 more figures