Table of Contents
Fetching ...

Unveiling Practical Shortcomings of Patch Overfitting Detection Techniques

David Williams, Ioakim Avraam, Aldeida Aleti, Matias Martinez, Justyna Petke, Federica Sarro

Abstract

Automated Program Repair (APR) can reduce the time developers spend debugging, allowing them to focus on other aspects of software development. Automatically generated bug patches are typically validated through software testing. However, this method can lead to patch overfitting, i.e., generating patches that pass the given tests but are still incorrect. Patch correctness assessment (also known as overfitting detection) techniques have been proposed to identify patches that overfit. However, prior work often assessed the effectiveness of these techniques in isolation and on datasets that do not reflect the distribution of correct-to-overfitting patches that would be generated by APR tools in typical use; thus, we still do not know their effectiveness in practice. This work presents the first comprehensive benchmarking study of several patch overfitting detection (POD) methods in a practical scenario. To this end, we curate datasets that reflect realistic assumptions (i.e., patches produced by tools run under the same experimental conditions). Next, we use these data to benchmark six state-of-the-art POD approaches -- spanning static analysis, dynamic testing, and learning-based approaches -- against two baselines based on random sampling (one from prior work and one proposed herein). Our results are striking: Simple random selection outperforms all POD tools for 71% to 96% of cases, depending on the POD tool. This suggests two main takeaways: (1) current POD tools offer limited practical benefit, highlighting the need for novel techniques; (2) any POD tool must be benchmarked on realistic data and against random sampling to prove its practical effectiveness. To this end, we encourage the APR community to continue improving POD techniques and to adopt our proposed methodology for practical benchmarking; we make our data and code available to facilitate such adoption.

Unveiling Practical Shortcomings of Patch Overfitting Detection Techniques

Abstract

Automated Program Repair (APR) can reduce the time developers spend debugging, allowing them to focus on other aspects of software development. Automatically generated bug patches are typically validated through software testing. However, this method can lead to patch overfitting, i.e., generating patches that pass the given tests but are still incorrect. Patch correctness assessment (also known as overfitting detection) techniques have been proposed to identify patches that overfit. However, prior work often assessed the effectiveness of these techniques in isolation and on datasets that do not reflect the distribution of correct-to-overfitting patches that would be generated by APR tools in typical use; thus, we still do not know their effectiveness in practice. This work presents the first comprehensive benchmarking study of several patch overfitting detection (POD) methods in a practical scenario. To this end, we curate datasets that reflect realistic assumptions (i.e., patches produced by tools run under the same experimental conditions). Next, we use these data to benchmark six state-of-the-art POD approaches -- spanning static analysis, dynamic testing, and learning-based approaches -- against two baselines based on random sampling (one from prior work and one proposed herein). Our results are striking: Simple random selection outperforms all POD tools for 71% to 96% of cases, depending on the POD tool. This suggests two main takeaways: (1) current POD tools offer limited practical benefit, highlighting the need for novel techniques; (2) any POD tool must be benchmarked on realistic data and against random sampling to prove its practical effectiveness. To this end, we encourage the APR community to continue improving POD techniques and to adopt our proposed methodology for practical benchmarking; we make our data and code available to facilitate such adoption.
Paper Structure (26 sections, 2 equations, 5 figures, 5 tables)

This paper contains 26 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: RQ1. Performance of 5 POD tools and RS baseline on the classical (top) and repairllama (bottom) datasets.
  • Figure 2: RQ1. UpSet plot of Correct Patches (top, 127) and Overfitting Patches (bottom, 671) that were correctly classified by each tool or subsets of tools for the classical dataset. The bar charts on the bottom left represent recall for each tool. Each dot corresponds to a set of correctly classified patches. Edges represent intersections, with the top bar chart representing the size of the intersections (%s on bar tops).
  • Figure 3: RQ1. MCC Scores on Bug-Level Predictions per Tool and per Project for the classical dataset.
  • Figure 4: RQ1. Matrix of Average MCC Scores for Patches Generated by APR Tools. The numbers in brackets after the APR tool name give the class balance (#correct : #overfitting).
  • Figure 5: RQ3. Comparison of tool performance against the WPC baseline for the classical dataset. Each tool’s point estimate and 95% bootstrap confidence interval are overlaid on the WPC performance envelope, traced across prior probabilities $p$ from 0.5 to 1.0 for predicting overfitting. The envelope represents the best achievable performance of a naive, distribution-aware classifier. As each tool’s performance is independent of $p$, their corresponding points are distributed along the x-axis to highlight potential interactions with the envelope; otherwise, they are positioned in the remaining space to ensure visual clarity.