Table of Contents
Fetching ...

Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing

Thanh Le-Cong, Dat Nguyen, Bach Le, Toby Murray

TL;DR

This article examines the naturalness of semantic-preserving transformations of neural program repair techniques, and proposes a new naturalness metric, namely RNC, using large language models, offering a promising direction for automating the naturalness assessment of code transformations.

Abstract

In this paper, we propose shifting the focus of robustness evaluation for Neural Program Repair (NPR) techniques toward naturally-occurring data transformations. To accomplish this, we first examine the naturalness of semantic-preserving transformations through a two-stage human study. This study includes (1) interviews with senior software developers to establish concrete criteria for evaluating the naturalness of these transformations, and (2) a survey involving 10 developers to assess the naturalness of 1,178 transformations, i.e., pairs of original and transformed programs, applied to 225 real-world bugs. Our findings show that only 60% of these transformations are deemed natural, while 20% are considered unnatural, with strong agreement among annotators. Moreover, the unnaturalness of these transformations significantly impacts both their applicability to benchmarks and the conclusions drawn from robustness testing. Next, we conduct natural robustness testing on NPR techniques to assess their true effectiveness against real-world data variations. Our experimental results reveal a substantial number of prediction changes in NPR techniques, leading to significant reductions in both plausible and correct patch rates when comparing performance on the original and transformed datasets. Additionally, we observe notable differences in performance improvements between NPR techniques, suggesting potential biases on NPR evaluation introduced by limited datasets. Finally, we propose an LLM-based metric to automate the assessment of transformation naturalness, ensuring the scalability of natural robustness testing.

Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing

TL;DR

This article examines the naturalness of semantic-preserving transformations of neural program repair techniques, and proposes a new naturalness metric, namely RNC, using large language models, offering a promising direction for automating the naturalness assessment of code transformations.

Abstract

In this paper, we propose shifting the focus of robustness evaluation for Neural Program Repair (NPR) techniques toward naturally-occurring data transformations. To accomplish this, we first examine the naturalness of semantic-preserving transformations through a two-stage human study. This study includes (1) interviews with senior software developers to establish concrete criteria for evaluating the naturalness of these transformations, and (2) a survey involving 10 developers to assess the naturalness of 1,178 transformations, i.e., pairs of original and transformed programs, applied to 225 real-world bugs. Our findings show that only 60% of these transformations are deemed natural, while 20% are considered unnatural, with strong agreement among annotators. Moreover, the unnaturalness of these transformations significantly impacts both their applicability to benchmarks and the conclusions drawn from robustness testing. Next, we conduct natural robustness testing on NPR techniques to assess their true effectiveness against real-world data variations. Our experimental results reveal a substantial number of prediction changes in NPR techniques, leading to significant reductions in both plausible and correct patch rates when comparing performance on the original and transformed datasets. Additionally, we observe notable differences in performance improvements between NPR techniques, suggesting potential biases on NPR evaluation introduced by limited datasets. Finally, we propose an LLM-based metric to automate the assessment of transformation naturalness, ensuring the scalability of natural robustness testing.
Paper Structure (62 sections, 4 equations, 7 figures, 19 tables)

This paper contains 62 sections, 4 equations, 7 figures, 19 tables.

Figures (7)

  • Figure 1: Overview of our study design
  • Figure 2: An example of survey for user study
  • Figure 3: Applicability of semantic-preserving transformations on 220 bugs from the Defects4J dataset. Naming, Expression, and Statement represent the number of applicable bugs for their respective categories of code transformation, while All presents results from all categories.
  • Figure 4: Proportion of naturalness categories in prediction changes of NPR techniques. Unnatural and Natural indicate likely unnatural/natural transformations received high agreement (consensus at least of 4/5 annotators). Likely Natural/Unnatural indicate likely unnatural/natural transformations received disagreement from human annotators
  • Figure 5: Prediction Changes of NPR techniques against all transformation (denoted by All Results) and natural transformations (denoted by Natural Results)
  • ...and 2 more figures

Theorems & Definitions (2)

  • definition 1
  • definition 2