Table of Contents
Fetching ...

IRCoCo: Immediate Rewards-Guided Deep Reinforcement Learning for Code Completion

Bolun Li, Zhihong Sun, Tao Huang, Hongyu Zhang, Yao Wan, Ge Li, Zhi Jin, Chen Lyu

TL;DR

IRCoCo introduces immediate rewards-guided deep reinforcement learning for line-level code completion, addressing exposure bias and the dynamic nature of code edits. It combines supervised fine-tuning with DRL in an actor-critic framework, using lightweight quality evaluators trained on BLEU and Edit-Sim to provide per-token rewards that guide the LM during progressive code completion. The approach yields consistent improvements across Python and Java datasets and multiple pre-trained LMs, demonstrating the efficacy of immediate token-level feedback over traditional delayed-reward DRL. This work highlights the practical potential of immediate rewards to enhance context-sensitive code generation and offers a foundation for further refinement of reward shaping in code intelligence.

Abstract

Code completion aims to enhance programming productivity by predicting potential code based on the current programming context. Recently, pretrained language models (LMs) have become prominent in this field. Various approaches have been proposed to fine-tune LMs using supervised fine-tuning (SFT) techniques for code completion. However, the inherent exposure bias of these models can cause errors to accumulate early in the sequence completion, leading to even more errors in subsequent completions. To address this problem, deep reinforcement learning (DRL) is an alternative technique for fine-tuning LMs for code completion, which can improve the generalization capabilities and overall performance. Nevertheless, integrating DRL-based strategies into code completion faces two major challenges: 1) The dynamic nature of the code context requires the completion model to quickly adapt to changes, which poses difficulties for conventional DRL strategies that focus on delayed rewarding of the final code state. 2) It is difficult to evaluate the correctness of partial code, thus the reward redistribution-based strategies cannot be adapted to code completion. To tackle these challenges, we propose IRCoCo, a code completion-specific DRL-based fine-tuning framework. This framework is designed to provide immediate rewards as feedback for detecting dynamic context changes arising from continuous edits during code completion. With the aid of immediate feedback, the fine-tuned LM can gain a more precise understanding of the current context, thereby enabling effective adjustment of the LM and optimizing code completion in a more refined manner. Experimental results demonstrate that fine-tuning pretrained LMs with IRCoCo leads to significant improvements in the code completion task, outperforming both SFT-based and other DRL-based baselines.

IRCoCo: Immediate Rewards-Guided Deep Reinforcement Learning for Code Completion

TL;DR

IRCoCo introduces immediate rewards-guided deep reinforcement learning for line-level code completion, addressing exposure bias and the dynamic nature of code edits. It combines supervised fine-tuning with DRL in an actor-critic framework, using lightweight quality evaluators trained on BLEU and Edit-Sim to provide per-token rewards that guide the LM during progressive code completion. The approach yields consistent improvements across Python and Java datasets and multiple pre-trained LMs, demonstrating the efficacy of immediate token-level feedback over traditional delayed-reward DRL. This work highlights the practical potential of immediate rewards to enhance context-sensitive code generation and offers a foundation for further refinement of reward shaping in code intelligence.

Abstract

Code completion aims to enhance programming productivity by predicting potential code based on the current programming context. Recently, pretrained language models (LMs) have become prominent in this field. Various approaches have been proposed to fine-tune LMs using supervised fine-tuning (SFT) techniques for code completion. However, the inherent exposure bias of these models can cause errors to accumulate early in the sequence completion, leading to even more errors in subsequent completions. To address this problem, deep reinforcement learning (DRL) is an alternative technique for fine-tuning LMs for code completion, which can improve the generalization capabilities and overall performance. Nevertheless, integrating DRL-based strategies into code completion faces two major challenges: 1) The dynamic nature of the code context requires the completion model to quickly adapt to changes, which poses difficulties for conventional DRL strategies that focus on delayed rewarding of the final code state. 2) It is difficult to evaluate the correctness of partial code, thus the reward redistribution-based strategies cannot be adapted to code completion. To tackle these challenges, we propose IRCoCo, a code completion-specific DRL-based fine-tuning framework. This framework is designed to provide immediate rewards as feedback for detecting dynamic context changes arising from continuous edits during code completion. With the aid of immediate feedback, the fine-tuned LM can gain a more precise understanding of the current context, thereby enabling effective adjustment of the LM and optimizing code completion in a more refined manner. Experimental results demonstrate that fine-tuning pretrained LMs with IRCoCo leads to significant improvements in the code completion task, outperforming both SFT-based and other DRL-based baselines.
Paper Structure (25 sections, 11 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 25 sections, 11 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: The developer-written code, completed by CodeGPT trained by SFT, completed by CodeGPT trained by DRL w/ delayed rewards.
  • Figure 2: Overview of the IRCoCo using the Actor-Critic Framework. First, the actor network samples synthetic samples. These samples are generated token by token and are sequentially added to the end of the incomplete code fragment. Afterward, they are rewarded by the critic. Leveraging these immediate rewards, the strategy is refined by integrating the IRCoCo framework, which employs a joint fine-tuning approach using SFT and DRL.
  • Figure 3: Overview of the Evaluator. Training the evaluator first requires preparing training data. In the training data preparation phase, we randomly split the complete code to obtain the incomplete code and reference code fragments. After that, we pass the incomplete code fragment through the LM to obtain the completed code and compute the score $s$. Finally, we pair the incomplete code fragment with the score $s$ to obtain the training data. In the training phase, we will obtain the score $s'$ by the evaluator, and the training goal is to minimize the MSE loss of $s$ and $s'$.
  • Figure 4: Comparison of the IRCoCo framework under different numbers of tokens (Py150 dataset).
  • Figure 5: Comparison of the IRCoCo framework under different numbers of tokens (Java Corpus dataset).
  • ...and 1 more figures