Table of Contents
Fetching ...

When Neural Code Completion Models Size up the Situation: Attaining Cheaper and Faster Completion through Dynamic Model Inference

Zhensu Sun, Xiaoning Du, Fu Song, Shangwen Wang, Li Li

TL;DR

The paper addresses the high cost of large neural code completion models by introducing dynamic inference through the Stop&Exit Controller (SEC). It demonstrates that many tokens can be predicted with shallow layers, while incorrect early predictions offer little value, motivating a mechanism to stop or skip layers during generation. SEC combines lightweight intermediate LM heads with a simple action classifier and a state-copying strategy to ensure coherent token generation, achieving average speedups around 11% and substantial gains when threshold tolerances are relaxed, with minimal declines in ROUGE-L and stable Acceptance Rates. The approach is validated on GPT-2 and CodeGen across Java and Python datasets, showing practical potential to reduce inference costs and enable more scalable, human-in-the-loop code completion systems.

Abstract

Leveraging recent advancements in large language models, modern neural code completion models have demonstrated the capability to generate highly accurate code suggestions. However, their massive size poses challenges in terms of computational costs and environmental impact, hindering their widespread adoption in practical scenarios. Dynamic inference emerges as a promising solution, as it allocates minimal computation during inference while maintaining the model's performance. In this research, we explore dynamic inference within the context of code completion. Initially, we conducted an empirical investigation on GPT-2, focusing on the inference capabilities of intermediate layers for code completion. We found that 54.4% of tokens can be accurately generated using just the first layer, signifying significant computational savings potential. Moreover, despite using all layers, the model still fails to predict 14.5% of tokens correctly, and the subsequent completions continued from them are rarely considered helpful, with only a 4.2% Acceptance Rate. These findings motivate our exploration of dynamic inference in code completion and inspire us to enhance it with a decision-making mechanism that stops the generation of incorrect code. We thus propose a novel dynamic inference method specifically tailored for code completion models. This method aims not only to produce correct predictions with largely reduced computation but also to prevent incorrect predictions proactively. Our extensive evaluation shows that it can averagely skip 1.7 layers out of 16 layers in the models, leading to an 11.2% speedup with only a marginal 1.1% reduction in ROUGE-L.

When Neural Code Completion Models Size up the Situation: Attaining Cheaper and Faster Completion through Dynamic Model Inference

TL;DR

The paper addresses the high cost of large neural code completion models by introducing dynamic inference through the Stop&Exit Controller (SEC). It demonstrates that many tokens can be predicted with shallow layers, while incorrect early predictions offer little value, motivating a mechanism to stop or skip layers during generation. SEC combines lightweight intermediate LM heads with a simple action classifier and a state-copying strategy to ensure coherent token generation, achieving average speedups around 11% and substantial gains when threshold tolerances are relaxed, with minimal declines in ROUGE-L and stable Acceptance Rates. The approach is validated on GPT-2 and CodeGen across Java and Python datasets, showing practical potential to reduce inference costs and enable more scalable, human-in-the-loop code completion systems.

Abstract

Leveraging recent advancements in large language models, modern neural code completion models have demonstrated the capability to generate highly accurate code suggestions. However, their massive size poses challenges in terms of computational costs and environmental impact, hindering their widespread adoption in practical scenarios. Dynamic inference emerges as a promising solution, as it allocates minimal computation during inference while maintaining the model's performance. In this research, we explore dynamic inference within the context of code completion. Initially, we conducted an empirical investigation on GPT-2, focusing on the inference capabilities of intermediate layers for code completion. We found that 54.4% of tokens can be accurately generated using just the first layer, signifying significant computational savings potential. Moreover, despite using all layers, the model still fails to predict 14.5% of tokens correctly, and the subsequent completions continued from them are rarely considered helpful, with only a 4.2% Acceptance Rate. These findings motivate our exploration of dynamic inference in code completion and inspire us to enhance it with a decision-making mechanism that stops the generation of incorrect code. We thus propose a novel dynamic inference method specifically tailored for code completion models. This method aims not only to produce correct predictions with largely reduced computation but also to prevent incorrect predictions proactively. Our extensive evaluation shows that it can averagely skip 1.7 layers out of 16 layers in the models, leading to an 11.2% speedup with only a marginal 1.1% reduction in ROUGE-L.
Paper Structure (25 sections, 3 equations, 5 figures, 3 tables)

This paper contains 25 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Examples to demonstrate the effects of SEC, where the left and right of the figure respectively demonstrate the scenario of STOP and EXIT.
  • Figure 2: Demonstration of the typical generation process of a 4-layer Transformer, where the model generates four tokens for the input sequence using four steps. [SOS] and [EOS] are special tokens respectively indicating the start of the input sequence and the end of the generation.
  • Figure 3: Demonstration of the working mechanism of SEC. The left part shows how SEC controls the inference using a classifier after Layer $i$ computing its hidden state. The right part showcases the generation process of a 4-layer SEC-enhanced LCM, where SEC exits at Round 4&6 and stops at Round 5. The layers skipped by SEC are in grey color.
  • Figure 4: The training process required for integrating SEC for an LCM (contains $n$ layers). The Layers and LM Heads with shadows indicate that they are well-trained and their parameters are frozen during training.
  • Figure 5: The accuracy of the action classifier of SEC.