Table of Contents
Fetching ...

Are Benchmark Tests Strong Enough? Mutation-Guided Diagnosis and Augmentation of Regression Suites

Chenglin Li, Yisen Xu, Zehao Wang, Shin Hwei Tan, Tse-Hsun, Chen

Abstract

Benchmarks driven by test suites, notably SWE-bench, have become the de facto standard for measuring the effectiveness of automated issue-resolution agents: a generated patch is accepted whenever it passes the accompanying regression tests. In practice, however, insufficiently strong test suites can admit plausible yet semantically incorrect patches, inflating reported success rates. We introduce STING, a framework for targeted test augmentation that uses semantically altered program variants as diagnostic stressors to uncover and repair weaknesses in benchmark regression suites. Variants of the ground-truth patch that still pass the existing tests reveal under-constrained behaviors; these gaps then guide the generation of focused regression tests. A generated test is retained only if it (i) passes on the ground-truth patch, (ii) fails on at least one variant that survived the original suite, and (iii) remains valid under behavior-preserving transformations designed to guard against overfitting. Applied to SWE-bench Verified, STING finds that 77% of instances contain at least one surviving variant. STING produces 1,014 validated tests spanning 211 instances and increases patch-region line and branch coverage by 10.8% and 9.5%, respectively. Re-assessing the top-10 repair agents with the strengthened suites lowers their resolved rates by 4.2%-9.0%, revealing that a substantial share of previously passing patches exploit weaknesses in the benchmark tests rather than faithfully implementing the intended fix. These results underscore that reliable benchmark evaluation depends not only on patch generation, but equally on test adequacy.

Are Benchmark Tests Strong Enough? Mutation-Guided Diagnosis and Augmentation of Regression Suites

Abstract

Benchmarks driven by test suites, notably SWE-bench, have become the de facto standard for measuring the effectiveness of automated issue-resolution agents: a generated patch is accepted whenever it passes the accompanying regression tests. In practice, however, insufficiently strong test suites can admit plausible yet semantically incorrect patches, inflating reported success rates. We introduce STING, a framework for targeted test augmentation that uses semantically altered program variants as diagnostic stressors to uncover and repair weaknesses in benchmark regression suites. Variants of the ground-truth patch that still pass the existing tests reveal under-constrained behaviors; these gaps then guide the generation of focused regression tests. A generated test is retained only if it (i) passes on the ground-truth patch, (ii) fails on at least one variant that survived the original suite, and (iii) remains valid under behavior-preserving transformations designed to guard against overfitting. Applied to SWE-bench Verified, STING finds that 77% of instances contain at least one surviving variant. STING produces 1,014 validated tests spanning 211 instances and increases patch-region line and branch coverage by 10.8% and 9.5%, respectively. Re-assessing the top-10 repair agents with the strengthened suites lowers their resolved rates by 4.2%-9.0%, revealing that a substantial share of previously passing patches exploit weaknesses in the benchmark tests rather than faithfully implementing the intended fix. These results underscore that reliable benchmark evaluation depends not only on patch generation, but equally on test adequacy.

Paper Structure

This paper contains 13 sections, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Motivating example (simplified django-11276 for easier understanding). The output of escape() is consumed by urlize(). The oracle patch updates both functions, but the plausible patch modifies only escape(). The developer test checks only the local output and passes on both patches, missing the incomplete fix. STING’s generated test instead exercises the end-to-end pipeline and reveals the missing update in urlize().
  • Figure 2: Overview of the STING framework.
  • Figure 3: Simplified Prompt template used in LLM-based Mutation
  • Figure 4: Simplified prompt template used in targeted test generation.
  • Figure 5: Coverage comparison between original and augmented test suites.