Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation
Spandan Garg, Benjamin Steenhoek, Yufan Huang
TL;DR
The paper addresses the mismatch between real-world IDE-based interactions and traditional bug-fixing benchmarks, which often overestimate agent capabilities. It introduces a telemetry-driven benchmark mutation pipeline that extracts developer communication templates and converts formal benchmark problems into realistic, chat-style queries, validated across Python, C#, and TypeScript benchmarks using the OpenHands agent and multiple LLMs. Key contributions include a template-based mutation approach, mutated datasets (SWE-Bench Verified-Mutated, SWE-Bench C#-Mutated, Multi-SWE-Bench-Mutated), and empirical evidence showing substantial performance gaps (20-40% on public benchmarks; 10-16% on internal benchmarks) when agents face realistic queries, highlighting benchmark overfitting. The work provides open-source prompts and mutations to enable replication and argues for adopting realism-focused, privacy-preserving benchmarks to better gauge agent capabilities and guide robust development.
Abstract
Current benchmarks for evaluating software engineering agents, such as SWE-Bench Verified, are predominantly derived from GitHub issues and fail to accurately reflect how developers interact with chat-based coding assistants in integrated development environments (IDEs). We posit that this mismatch leads to a systematic overestimation of agent's capabilities in real-world scenarios, especially bug fixing. We introduce a novel benchmarking framework that transforms existing formal benchmarks into realistic user queries through systematic analysis of developer interaction patterns with chat-based agents. Our methodology is flexible and can be easily extended to existing benchmarks. In this paper, we apply our testing framework to SWE-Bench Verified, the TypeScript subset of Multi-SWE-Bench and a private benchmark, SWE-Bench C# and transform formal GitHub issue descriptions into realistic user-style queries based on telemetry analysis of a popular chat-based agent interactions. Our findings reveal that existing benchmarks significantly overestimate agent capabilities for some models by >50% over baseline performance for public benchmarks and ~10-16% for our internal benchmark. This work establishes a new paradigm for evaluating interactive chat-based software engineering agents through benchmark mutation techniques.
