Table of Contents
Fetching ...

Automatically Recommend Code Updates: Are We There Yet?

Yue Liu, Chakkrit Tantithamthavorn, Yonghui Liu, Patanamon Thongtanunam, Li Li

TL;DR

This paper presents the first extensive empirical evaluation of state-of-the-art CodeLMs for automatically recommending code updates on two real-world paired-method datasets with time-aware splits. It finds that CodeLMs achieve notable accuracy in time-ignored settings but degrade dramatically in realistic time-wise scenarios and cross-project generalization, with many updates being unchanged or trivial. The study also shows high syntactic correctness in generated updates but frequent non-substantive edits and limitations for larger methods or more complex updates. Overall, the work exposes a significant gap between reported benchmarks and real-world practicality, and it calls for future research to improve robustness, efficiency, and generalizability of code-update tools.

Abstract

In recent years, large pre-trained Language Models of Code (CodeLMs) have shown promising results on various software engineering tasks. One such task is automatic code update recommendation, which transforms outdated code snippets into their approved and revised counterparts. Although many CodeLM-based approaches have been proposed, claiming high accuracy, their effectiveness and reliability on real-world code update tasks remain questionable. In this paper, we present the first extensive evaluation of state-of-the-art CodeLMs for automatically recommending code updates. We assess their performance on two diverse datasets of paired updated methods, considering factors such as temporal evolution, project specificity, method size, and update complexity. Our results reveal that while CodeLMs perform well in settings that ignore temporal information, they struggle in more realistic time-wise scenarios and generalize poorly to new projects. Furthermore, CodeLM performance decreases significantly for larger methods and more complex updates. Furthermore, we observe that many CodeLM-generated "updates" are actually null, especially in time-wise settings, and meaningful edits remain challenging. Our findings highlight the significant gap between the perceived and actual effectiveness of CodeLMs for real-world code update recommendation and emphasize the need for more research on improving their practicality, robustness, and generalizability.

Automatically Recommend Code Updates: Are We There Yet?

TL;DR

This paper presents the first extensive empirical evaluation of state-of-the-art CodeLMs for automatically recommending code updates on two real-world paired-method datasets with time-aware splits. It finds that CodeLMs achieve notable accuracy in time-ignored settings but degrade dramatically in realistic time-wise scenarios and cross-project generalization, with many updates being unchanged or trivial. The study also shows high syntactic correctness in generated updates but frequent non-substantive edits and limitations for larger methods or more complex updates. Overall, the work exposes a significant gap between reported benchmarks and real-world practicality, and it calls for future research to improve robustness, efficiency, and generalizability of code-update tools.

Abstract

In recent years, large pre-trained Language Models of Code (CodeLMs) have shown promising results on various software engineering tasks. One such task is automatic code update recommendation, which transforms outdated code snippets into their approved and revised counterparts. Although many CodeLM-based approaches have been proposed, claiming high accuracy, their effectiveness and reliability on real-world code update tasks remain questionable. In this paper, we present the first extensive evaluation of state-of-the-art CodeLMs for automatically recommending code updates. We assess their performance on two diverse datasets of paired updated methods, considering factors such as temporal evolution, project specificity, method size, and update complexity. Our results reveal that while CodeLMs perform well in settings that ignore temporal information, they struggle in more realistic time-wise scenarios and generalize poorly to new projects. Furthermore, CodeLM performance decreases significantly for larger methods and more complex updates. Furthermore, we observe that many CodeLM-generated "updates" are actually null, especially in time-wise settings, and meaningful edits remain challenging. Our findings highlight the significant gap between the perceived and actual effectiveness of CodeLMs for real-world code update recommendation and emphasize the need for more research on improving their practicality, robustness, and generalizability.
Paper Structure (39 sections, 2 equations, 9 figures, 5 tables)

This paper contains 39 sections, 2 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: An example of collected data triplet.
  • Figure 2: Comparison of CodeLMs’ perfect prediction rates in time-ignore and time-wise scenarios
  • Figure 3: Impact of beam search size on CodeLMs’ performance for AndroZooUpdate-S
  • Figure 4: Comparison of CodeLMs’ PP% for within-project and cross-project code updates in the time-ignore setting
  • Figure 5: Syntactical correctness (in percentage) of the code generated by CodeLMs
  • ...and 4 more figures