Table of Contents
Fetching ...

Bogus Bugs, Duplicates, and Revealing Comments: Data Quality Issues in NPR

Julian Aron Prenner, Romain Robbes

TL;DR

The paper tackles data-quality problems in neural program repair (NPR) by auditing major datasets (Megadiff, TSSB-3M, CoCoNuT) and a Defects4J benchmark, uncovering duplicates, bogus bugs, and leakage through revealing comments. It demonstrates that even a modest level of data impurity can affect robustness and shows that simple data-curation strategies, such as filtering bogus changes, can substantially improve model robustness with minimal impact on repair performance, while data augmentation offers another promising avenue. The authors advocate for data-centric approaches in NPR and propose practical remedies, including regular-expression-based filtering and potential data-tracks in APR competitions, to promote higher-quality training data. Overall, the work highlights the critical role of data quality in NPR and provides actionable guidance for more robust, data-focused development of APR systems.

Abstract

The performance of a machine learning system is not only determined by the model but also, to a substantial degree, by the data it is trained on. With the increasing use of machine learning, issues related to data quality have become a concern also in automated program repair research. In this position paper, we report some of the data-related issues we have come across when working with several large APR datasets and benchmarks, including, for instance, duplicates or "bogus bugs". We briefly discuss the potential impact of these problems on repair performance and propose possible remedies. We believe that more data-focused approaches could improve the performance and robustness of current and future APR systems.

Bogus Bugs, Duplicates, and Revealing Comments: Data Quality Issues in NPR

TL;DR

The paper tackles data-quality problems in neural program repair (NPR) by auditing major datasets (Megadiff, TSSB-3M, CoCoNuT) and a Defects4J benchmark, uncovering duplicates, bogus bugs, and leakage through revealing comments. It demonstrates that even a modest level of data impurity can affect robustness and shows that simple data-curation strategies, such as filtering bogus changes, can substantially improve model robustness with minimal impact on repair performance, while data augmentation offers another promising avenue. The authors advocate for data-centric approaches in NPR and propose practical remedies, including regular-expression-based filtering and potential data-tracks in APR competitions, to promote higher-quality training data. Overall, the work highlights the critical role of data quality in NPR and provides actionable guidance for more robust, data-focused development of APR systems.

Abstract

The performance of a machine learning system is not only determined by the model but also, to a substantial degree, by the data it is trained on. With the increasing use of machine learning, issues related to data quality have become a concern also in automated program repair research. In this position paper, we report some of the data-related issues we have come across when working with several large APR datasets and benchmarks, including, for instance, duplicates or "bogus bugs". We briefly discuss the potential impact of these problems on repair performance and propose possible remedies. We believe that more data-focused approaches could improve the performance and robustness of current and future APR systems.

Paper Structure

This paper contains 9 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Bogus bug in Megadiff (debug code).
  • Figure 2: Bogus bug in TSSB-3M (docfix).
  • Figure 3: Excerpt from the buggy version of Time#12; the comment // handle years in era BC was introduced as part of the patch but appears also in the buggy code. The indentation of the comment further indicates that the fix likely requires an if-statement (which is the case).
  • Figure 4: The two patches generated by the unfiltered model (U) and the filtered model (F) for a perturbed bug. The latter overlooks the actual bug and removes the logging statement (System.out.println).