Table of Contents
Fetching ...

When to Stop? Towards Efficient Code Generation in LLMs with Excess Token Prevention

Lianghong Guo, Yanlin Wang, Ensheng Shi, Wanjun Zhong, Hongyu Zhang, Jiachi Chen, Ruikai Zhang, Yuchi Ma, Zibin Zheng

TL;DR

CodeFast identifies excess token generation as a major bottleneck in Code LLM inference and introduces GenGuard, a lightweight gating classifier trained via an automatic data-construction framework. Coupled with a line-voting mechanism, CodeFast rapidly terminates generation when extraneous content is detected, yielding substantial speedups (up to 4.52x) while preserving code quality across multiple Code LLMs and programming languages. The approach relies on a minimal, cross-language GenGuard trained with automatically labeled data, and demonstrates strong stability and generalization to untrained datasets, including class-level code generation. Overall, CodeFast offers a practical, extensible path to deploy faster Code LLMs in real-world development environments and IDEs, with code and data publicly available.

Abstract

Code generation aims to automatically generate code snippets that meet given natural language requirements and plays an important role in software development. Although Code LLMs have shown excellent performance in this domain, their long generation time poses a signification limitation in practice use. In this paper, we first conduct an in-depth preliminary study with different Code LLMs on code generation tasks and identify a significant efficiency issue, i.e., continual generation of excess tokens. It harms the developer productivity and leads to huge computational wastes. To address it, we introduce CodeFast, an inference acceleration approach for Code LLMs on code generation. The key idea of CodeFast is to terminate the inference process in time when unnecessary excess tokens are detected. First, we propose an automatic data construction framework to obtain training data. Then, we train a unified lightweight model GenGuard applicable to multiple programming languages to predict whether to terminate inference at the current step. Finally, we enhance Code LLM with GenGuard to accelerate its inference in code generation tasks. We conduct extensive experiments with CodeFast on five representative Code LLMs across four widely used code generation datasets. Experimental results show that (1) CodeFast can significantly improve the inference speed of various Code LLMs in code generation, ranging form 34% to 452%, without compromising the quality of generated code. (2) CodeFast is stable across different parameter settings and can generalize to untrained datasets. Our code and data are available at https://github.com/DeepSoftwareAnalytics/CodeFast

When to Stop? Towards Efficient Code Generation in LLMs with Excess Token Prevention

TL;DR

CodeFast identifies excess token generation as a major bottleneck in Code LLM inference and introduces GenGuard, a lightweight gating classifier trained via an automatic data-construction framework. Coupled with a line-voting mechanism, CodeFast rapidly terminates generation when extraneous content is detected, yielding substantial speedups (up to 4.52x) while preserving code quality across multiple Code LLMs and programming languages. The approach relies on a minimal, cross-language GenGuard trained with automatically labeled data, and demonstrates strong stability and generalization to untrained datasets, including class-level code generation. Overall, CodeFast offers a practical, extensible path to deploy faster Code LLMs in real-world development environments and IDEs, with code and data publicly available.

Abstract

Code generation aims to automatically generate code snippets that meet given natural language requirements and plays an important role in software development. Although Code LLMs have shown excellent performance in this domain, their long generation time poses a signification limitation in practice use. In this paper, we first conduct an in-depth preliminary study with different Code LLMs on code generation tasks and identify a significant efficiency issue, i.e., continual generation of excess tokens. It harms the developer productivity and leads to huge computational wastes. To address it, we introduce CodeFast, an inference acceleration approach for Code LLMs on code generation. The key idea of CodeFast is to terminate the inference process in time when unnecessary excess tokens are detected. First, we propose an automatic data construction framework to obtain training data. Then, we train a unified lightweight model GenGuard applicable to multiple programming languages to predict whether to terminate inference at the current step. Finally, we enhance Code LLM with GenGuard to accelerate its inference in code generation tasks. We conduct extensive experiments with CodeFast on five representative Code LLMs across four widely used code generation datasets. Experimental results show that (1) CodeFast can significantly improve the inference speed of various Code LLMs in code generation, ranging form 34% to 452%, without compromising the quality of generated code. (2) CodeFast is stable across different parameter settings and can generalize to untrained datasets. Our code and data are available at https://github.com/DeepSoftwareAnalytics/CodeFast
Paper Structure (31 sections, 8 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 31 sections, 8 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: A motivating example of code generated from CodeLlama-7B.
  • Figure 2: Experimental results of Code LLMs on the MBGP dataset: a comparison between original and expected scenarios.
  • Figure 3: Results of qualitative analysis in the preliminary study.
  • Figure 4: Overview of CodeFast.
  • Figure 5: An example of GenGuard-enhanced Code LLM inference process.
  • ...and 2 more figures