Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing

Thanh Le-Cong; Dat Nguyen; Bach Le; Toby Murray

Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing

Thanh Le-Cong, Dat Nguyen, Bach Le, Toby Murray

TL;DR

This article examines the naturalness of semantic-preserving transformations of neural program repair techniques, and proposes a new naturalness metric, namely RNC, using large language models, offering a promising direction for automating the naturalness assessment of code transformations.

Abstract

In this paper, we propose shifting the focus of robustness evaluation for Neural Program Repair (NPR) techniques toward naturally-occurring data transformations. To accomplish this, we first examine the naturalness of semantic-preserving transformations through a two-stage human study. This study includes (1) interviews with senior software developers to establish concrete criteria for evaluating the naturalness of these transformations, and (2) a survey involving 10 developers to assess the naturalness of 1,178 transformations, i.e., pairs of original and transformed programs, applied to 225 real-world bugs. Our findings show that only 60% of these transformations are deemed natural, while 20% are considered unnatural, with strong agreement among annotators. Moreover, the unnaturalness of these transformations significantly impacts both their applicability to benchmarks and the conclusions drawn from robustness testing. Next, we conduct natural robustness testing on NPR techniques to assess their true effectiveness against real-world data variations. Our experimental results reveal a substantial number of prediction changes in NPR techniques, leading to significant reductions in both plausible and correct patch rates when comparing performance on the original and transformed datasets. Additionally, we observe notable differences in performance improvements between NPR techniques, suggesting potential biases on NPR evaluation introduced by limited datasets. Finally, we propose an LLM-based metric to automate the assessment of transformation naturalness, ensuring the scalability of natural robustness testing.

Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing

TL;DR

Abstract

Paper Structure (62 sections, 4 equations, 7 figures, 19 tables)

This paper contains 62 sections, 4 equations, 7 figures, 19 tables.

Introduction
Background and Related Works
Program Repair
Neural Program Repair
Empirical Studies
Program Repair Dataset.
Code Naturalness
Applications of Semantic-preserving Transformation
Assessment Criteria for Naturalness of Code Transformations
Interview Design
Participant Recruitment.
Interview.
Data Analysis
Findings
Manual Assessment and Empirical Study Design
...and 47 more sections

Figures (7)

Figure 1: Overview of our study design
Figure 2: An example of survey for user study
Figure 3: Applicability of semantic-preserving transformations on 220 bugs from the Defects4J dataset. Naming, Expression, and Statement represent the number of applicable bugs for their respective categories of code transformation, while All presents results from all categories.
Figure 4: Proportion of naturalness categories in prediction changes of NPR techniques. Unnatural and Natural indicate likely unnatural/natural transformations received high agreement (consensus at least of 4/5 annotators). Likely Natural/Unnatural indicate likely unnatural/natural transformations received disagreement from human annotators
Figure 5: Prediction Changes of NPR techniques against all transformation (denoted by All Results) and natural transformations (denoted by Natural Results)
...and 2 more figures

Theorems & Definitions (2)

definition 1
definition 2

Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing

TL;DR

Abstract

Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (2)