Table of Contents
Fetching ...

Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation

Spandan Garg, Benjamin Steenhoek, Yufan Huang

TL;DR

The paper addresses the mismatch between real-world IDE-based interactions and traditional bug-fixing benchmarks, which often overestimate agent capabilities. It introduces a telemetry-driven benchmark mutation pipeline that extracts developer communication templates and converts formal benchmark problems into realistic, chat-style queries, validated across Python, C#, and TypeScript benchmarks using the OpenHands agent and multiple LLMs. Key contributions include a template-based mutation approach, mutated datasets (SWE-Bench Verified-Mutated, SWE-Bench C#-Mutated, Multi-SWE-Bench-Mutated), and empirical evidence showing substantial performance gaps (20-40% on public benchmarks; 10-16% on internal benchmarks) when agents face realistic queries, highlighting benchmark overfitting. The work provides open-source prompts and mutations to enable replication and argues for adopting realism-focused, privacy-preserving benchmarks to better gauge agent capabilities and guide robust development.

Abstract

Current benchmarks for evaluating software engineering agents, such as SWE-Bench Verified, are predominantly derived from GitHub issues and fail to accurately reflect how developers interact with chat-based coding assistants in integrated development environments (IDEs). We posit that this mismatch leads to a systematic overestimation of agent's capabilities in real-world scenarios, especially bug fixing. We introduce a novel benchmarking framework that transforms existing formal benchmarks into realistic user queries through systematic analysis of developer interaction patterns with chat-based agents. Our methodology is flexible and can be easily extended to existing benchmarks. In this paper, we apply our testing framework to SWE-Bench Verified, the TypeScript subset of Multi-SWE-Bench and a private benchmark, SWE-Bench C# and transform formal GitHub issue descriptions into realistic user-style queries based on telemetry analysis of a popular chat-based agent interactions. Our findings reveal that existing benchmarks significantly overestimate agent capabilities for some models by >50% over baseline performance for public benchmarks and ~10-16% for our internal benchmark. This work establishes a new paradigm for evaluating interactive chat-based software engineering agents through benchmark mutation techniques.

Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation

TL;DR

The paper addresses the mismatch between real-world IDE-based interactions and traditional bug-fixing benchmarks, which often overestimate agent capabilities. It introduces a telemetry-driven benchmark mutation pipeline that extracts developer communication templates and converts formal benchmark problems into realistic, chat-style queries, validated across Python, C#, and TypeScript benchmarks using the OpenHands agent and multiple LLMs. Key contributions include a template-based mutation approach, mutated datasets (SWE-Bench Verified-Mutated, SWE-Bench C#-Mutated, Multi-SWE-Bench-Mutated), and empirical evidence showing substantial performance gaps (20-40% on public benchmarks; 10-16% on internal benchmarks) when agents face realistic queries, highlighting benchmark overfitting. The work provides open-source prompts and mutations to enable replication and argues for adopting realism-focused, privacy-preserving benchmarks to better gauge agent capabilities and guide robust development.

Abstract

Current benchmarks for evaluating software engineering agents, such as SWE-Bench Verified, are predominantly derived from GitHub issues and fail to accurately reflect how developers interact with chat-based coding assistants in integrated development environments (IDEs). We posit that this mismatch leads to a systematic overestimation of agent's capabilities in real-world scenarios, especially bug fixing. We introduce a novel benchmarking framework that transforms existing formal benchmarks into realistic user queries through systematic analysis of developer interaction patterns with chat-based agents. Our methodology is flexible and can be easily extended to existing benchmarks. In this paper, we apply our testing framework to SWE-Bench Verified, the TypeScript subset of Multi-SWE-Bench and a private benchmark, SWE-Bench C# and transform formal GitHub issue descriptions into realistic user-style queries based on telemetry analysis of a popular chat-based agent interactions. Our findings reveal that existing benchmarks significantly overestimate agent capabilities for some models by >50% over baseline performance for public benchmarks and ~10-16% for our internal benchmark. This work establishes a new paradigm for evaluating interactive chat-based software engineering agents through benchmark mutation techniques.

Paper Structure

This paper contains 30 sections, 12 figures, 1 table.

Figures (12)

  • Figure 1: Distribution of High-Level categories in user queries to a coding agent. We can see that the top categories are Code Search, Analysis (Blue), Feature Implementation (Orange) and Bug Fixing (Green).
  • Figure 2: Categorization of user queries to a software engineering agent into 10 high-level categories. We show an example for each category of user query.
  • Figure 3: Distribution of word counts in the problem statements of different benchmark compared to real-world user queries. The distributions show how much more concise telemetry queries tend to be compared to bug-fixing benchmarks.
  • Figure 4: Plot showing a comparison of how developers communicate in benchmarks vs real-world user queries to chat-based agents. We can see that telemetry queries contain very different kinds of information compared to GitHub issues.
  • Figure 5: A high-level overview of our benchmark mutation approach.
  • ...and 7 more figures