Table of Contents
Fetching ...

SWE Context Bench: A Benchmark for Context Learning in Coding

Jared Zhu, Minhao Hu, Junde Wu

TL;DR

SWE-ContextBench targets the underexplored dimension of experience reuse in programming agents, moving beyond isolated task correctness to measure how prior solutions and summaries can accelerate solving related tasks. Built on SWE-Bench Lite, it constructs 300 base tasks and 99 related tasks across linked software development artifacts, and evaluates agents on accuracy, time, and cost under multiple retrieval settings, including oracle-guided and autonomous retrieval. The results show that compact, correctly selected summaries provide the biggest gains in both solution quality and efficiency, while unfiltered or poorly chosen experience yields little or negative benefits; full execution traces offer marginal benefits and may hurt efficiency if selection is unreliable. The benchmark thus offers a principled, scalable framework for studying memory-augmented agents in software engineering and informs retrieval strategies and representation for cumulative learning in coding tasks.

Abstract

Large language models are increasingly used as programming agents for repository level software engineering tasks. While recent benchmarks evaluate correctness in realistic codebases, they largely treat tasks as independent and do not assess whether agents can reuse experience across related problems. As a result, the ability of agents to accumulate, retrieve, and apply prior experience, as well as the efficiency gains from such reuse, remains difficult to measure. We introduce SWE-ContextBench, a benchmark designed to explicitly evaluate experience reuse in programming agents. Built on SWE-Bench Lite, SWE-ContextBench augments 300 base tasks with 99 related tasks derived from real dependency and reference relationships among GitHub issues and pull requests, forming task sequences with shared context. The benchmark evaluates agents along three complementary dimensions: prediction accuracy, time efficiency, and cost efficiency. Using SWE-ContextBench, we study multiple experience reuse settings, including oracle guided and autonomous retrieval, as well as full execution trajectories and compact summaries. Our results show that correctly selected summarized experience improves resolution accuracy and substantially reduces runtime and token cost, particularly on harder tasks. In contrast, unfiltered or incorrectly selected experience provides limited or negative benefits. These findings highlight the importance of experience representation and retrieval quality, and position SWE-ContextBench as a principled benchmark for studying experience reuse in programming agents.

SWE Context Bench: A Benchmark for Context Learning in Coding

TL;DR

SWE-ContextBench targets the underexplored dimension of experience reuse in programming agents, moving beyond isolated task correctness to measure how prior solutions and summaries can accelerate solving related tasks. Built on SWE-Bench Lite, it constructs 300 base tasks and 99 related tasks across linked software development artifacts, and evaluates agents on accuracy, time, and cost under multiple retrieval settings, including oracle-guided and autonomous retrieval. The results show that compact, correctly selected summaries provide the biggest gains in both solution quality and efficiency, while unfiltered or poorly chosen experience yields little or negative benefits; full execution traces offer marginal benefits and may hurt efficiency if selection is unreliable. The benchmark thus offers a principled, scalable framework for studying memory-augmented agents in software engineering and informs retrieval strategies and representation for cumulative learning in coding tasks.

Abstract

Large language models are increasingly used as programming agents for repository level software engineering tasks. While recent benchmarks evaluate correctness in realistic codebases, they largely treat tasks as independent and do not assess whether agents can reuse experience across related problems. As a result, the ability of agents to accumulate, retrieve, and apply prior experience, as well as the efficiency gains from such reuse, remains difficult to measure. We introduce SWE-ContextBench, a benchmark designed to explicitly evaluate experience reuse in programming agents. Built on SWE-Bench Lite, SWE-ContextBench augments 300 base tasks with 99 related tasks derived from real dependency and reference relationships among GitHub issues and pull requests, forming task sequences with shared context. The benchmark evaluates agents along three complementary dimensions: prediction accuracy, time efficiency, and cost efficiency. Using SWE-ContextBench, we study multiple experience reuse settings, including oracle guided and autonomous retrieval, as well as full execution trajectories and compact summaries. Our results show that correctly selected summarized experience improves resolution accuracy and substantially reduces runtime and token cost, particularly on harder tasks. In contrast, unfiltered or incorrectly selected experience provides limited or negative benefits. These findings highlight the importance of experience representation and retrieval quality, and position SWE-ContextBench as a principled benchmark for studying experience reuse in programming agents.
Paper Structure (18 sections, 1 figure, 3 tables)