Ranking Plausible Patches by Historic Feature Frequencies

Shifat Sahariar Bhuiyan; Abhishek Tiwari; Yu Pei; Carlo A. Furia

Ranking Plausible Patches by Historic Feature Frequencies

Shifat Sahariar Bhuiyan, Abhishek Tiwari, Yu Pei, Carlo A. Furia

TL;DR

Preva-Rank introduces a lightweight, history-informed approach to rank plausible patches produced by any APR technique. It learns frequency distributions $\\mathds{F}_{c,k}$ linking bug categories $c$ to patch kinds $k$ from historic fixes, then ranks new patches by their alignment with the most likely category. In experiments over 81 Java projects and 168 Defects4J bugs, Preva-Rank improved the ranking of correct fixes, achieving substantial gains in top-$k$ positions while maintaining negligible overhead. The method is tool-agnostic and scalable, offering practical value for reducing developer effort in triaging patches and enhancing APR usability in real-world pipelines.

Abstract

Automated program repair (APR) techniques have achieved conspicuous progress, and are now capable of producing genuinely correct fixes in scenarios that were well beyond their capabilities only a few years ago. Nevertheless, even when an APR technique can find a correct fix for a bug, it still runs the risk of ranking the fix lower than other patches that are plausible (they pass all available tests) but incorrect. This can seriously hurt the technique's practical effectiveness, as the user will have to peruse a larger number of patches before finding the correct one. This paper presents PrevaRank, a technique that ranks plausible patches produced by any APR technique according to their feature similarity with historic programmer-written fixes for similar bugs. PrevaRank implements simple heuristics, which help make it scalable and applicable to any APR tool that produces plausible patches. In our experimental evaluation, after training PrevaRank on the fix history of 81 open-source Java projects, we used it to rank patches produced by 8 Java APR tools on 168 Defects4J bugs. PrevaRank consistently improved the ranking of correct fixes: for example, it ranked a correct fix within the top-3 positions in 27% more cases than the original tools did. Other experimental results indicate that PrevaRank works robustly with a variety of APR tools and bugs, with negligible overhead.

Ranking Plausible Patches by Historic Feature Frequencies

TL;DR

Preva-Rank introduces a lightweight, history-informed approach to rank plausible patches produced by any APR technique. It learns frequency distributions

linking bug categories

to patch kinds

from historic fixes, then ranks new patches by their alignment with the most likely category. In experiments over 81 Java projects and 168 Defects4J bugs, Preva-Rank improved the ranking of correct fixes, achieving substantial gains in top-

positions while maintaining negligible overhead. The method is tool-agnostic and scalable, offering practical value for reducing developer effort in triaging patches and enhancing APR usability in real-world pipelines.

Abstract

Paper Structure (31 sections, 2 equations, 5 figures, 9 tables, 1 algorithm)

This paper contains 31 sections, 2 equations, 5 figures, 9 tables, 1 algorithm.

Introduction
How [0.5]Preva-Rank Works
Training.
Ranking.
Training Data
Bug Classification
Patch Classification
Historic Frequencies
Bug Category Estimation
Patch Ranking
Experimental Evaluation
Implementation and performance.
Training Phase
Projects.
History mining.
...and 16 more sections

Figures (5)

Figure 1: An overview of [0.5]Preva-Rank's training phase: by mining historic data in software repositories, [0.5]Preva-Rank builds distributions $\mathds{F} \IfNoValueTF{c}{}{\IfStrEq{c}{}{\!}{_{c}}} \IfNoValueTF{k}{}{\left({k}\right)}$ that summarize how frequently a certain patch kind $k$ was used by developers to fix a certain bug category $c$.
Figure 2: An overview of [0.5]Preva-Rank's ranking phase: [0.5]Preva-Rank determines the kind of each plausible patch (generated by a program repair tool), using the same heuristics used in the training phase; based on the frequency statistics collected in the training phase, it also estimates the category$\overline{c}$ of the bug these patches are fixing; finally, it ranks each plausible patch of kind $k$ according to the frequency distribution $\mathds{F} \IfNoValueTF{\overline{c}}{}{\IfStrEq{\overline{c}}{}{\!}{_{\overline{c}}}} \IfNoValueTF{k}{}{\left({k}\right)}$.
Figure 3: Each point represents a bug fixed by an APR tool. The point's $x$ coordinate is the rank of the (first) correct fix assigned by the APR tool; the point's $y$ coordinate is the rank assigned by [0.5]Preva-Rank. Thus, points below the diagonal line correspond to bugs where [0.5]Preva-Rank improved the APR tool's ranking of correct fixes. The dotted line is the linear regression line of the points, which highlights the data trend. Axis scales are logarithmic.
Figure 4: Percentage of bugs of each category, among those for which each APR tool can generate at least one correct fix, that are ranked in the top-1, top-3, top-5, and top-10 positions only in the APR tool's original ranking, only in [0.5]Preva-Rank's ranking, or in both rankings. (These groupings correspond to \ref{['tab:rq1-main']}'s three scenarios o, p, and b, respectively.)
Figure 5: Each point represents a bug fixed by an APR tool. The point's $x$ coordinate is the rank assigned by [0.5]Preva-Rank when it uses a randomly sampled subset of the training data; the point's $y$ coordinate is the rank assigned by [0.5]Preva-Rank when it uses all training data. Thus, points below the diagonal line correspond to bugs where using all training data improves the ranking over using only a subset. The dotted lines are the linear regression lines of the points in each group. Axis scales are logarithmic.

Ranking Plausible Patches by Historic Feature Frequencies

TL;DR

Abstract

Ranking Plausible Patches by Historic Feature Frequencies

Authors

TL;DR

Abstract

Table of Contents

Figures (5)