Table of Contents
Fetching ...

Code Needs Comments: Enhancing Code LLMs with Comment Augmentation

Demin Song, Honglin Guo, Yunhua Zhou, Shuhao Xing, Yudong Wang, Zifan Song, Wenwei Zhang, Qipeng Guo, Hang Yan, Xipeng Qiu, Dahua Lin

TL;DR

This work investigates how NL code alignment is affected by comment density in pre training data and proposes a self supervised data augmentation pipeline that generates and filters code comments to enrich NL aligned code data. By instruction tuning a comment generator and applying constrained, line by line generation with explicit and implicit filtering, the authors build a self augmentation loop that improves multiple code focused LLMs. Experiments on several models and datasets show that higher comment density in training data yields consistent gains on standard code skill benchmarks and that the augmented data can transfer to other models. The approach demonstrates a scalable pathway for code LLMs to self improve by leveraging NL aligned comments, with practical implications for efficiency and data quality in pre training pipelines.

Abstract

The programming skill is one crucial ability for Large Language Models (LLMs), necessitating a deep understanding of programming languages (PLs) and their correlation with natural languages (NLs). We examine the impact of pre-training data on code-focused LLMs' performance by assessing the comment density as a measure of PL-NL alignment. Given the scarcity of code-comment aligned data in pre-training corpora, we introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language. We conducted experiments on three code-focused LLMs and observed consistent improvements in performance on two widely-used programming skill benchmarks. Notably, the model trained on the augmented data outperformed both the model used for generating comments and the model further trained on the data without augmentation.

Code Needs Comments: Enhancing Code LLMs with Comment Augmentation

TL;DR

This work investigates how NL code alignment is affected by comment density in pre training data and proposes a self supervised data augmentation pipeline that generates and filters code comments to enrich NL aligned code data. By instruction tuning a comment generator and applying constrained, line by line generation with explicit and implicit filtering, the authors build a self augmentation loop that improves multiple code focused LLMs. Experiments on several models and datasets show that higher comment density in training data yields consistent gains on standard code skill benchmarks and that the augmented data can transfer to other models. The approach demonstrates a scalable pathway for code LLMs to self improve by leveraging NL aligned comments, with practical implications for efficiency and data quality in pre training pipelines.

Abstract

The programming skill is one crucial ability for Large Language Models (LLMs), necessitating a deep understanding of programming languages (PLs) and their correlation with natural languages (NLs). We examine the impact of pre-training data on code-focused LLMs' performance by assessing the comment density as a measure of PL-NL alignment. Given the scarcity of code-comment aligned data in pre-training corpora, we introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language. We conducted experiments on three code-focused LLMs and observed consistent improvements in performance on two widely-used programming skill benchmarks. Notably, the model trained on the augmented data outperformed both the model used for generating comments and the model further trained on the data without augmentation.
Paper Structure (28 sections, 1 equation, 5 figures, 7 tables, 1 algorithm)

This paper contains 28 sections, 1 equation, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustrates the workflow of our proposed self-augmentation method. Firstly, it enables LLMs to generate comments for code through instruction tuning. Then, LLMs generate comments for existing code. The further training is conducted on enriched code data with comments, aiming to achieve self-augmentation.
  • Figure 2: If the LLM discovers code with low training value, it will output <|EOT|> to implement an implicit filtering mechanism.
  • Figure 3: Illustration of the constrained generation algorithm. During the generation process, the code will be directly copied into the output until it encounters the marker indicating the beginning of a comment (#, "' or """ for Python). The commented portion is generated by the code comment generator until the end of the comment (\\ n, "' or """, correspondingly).
  • Figure 4: HumanEval performance variation with respect to the number of training tokens.
  • Figure 5: Heat map of speedup ratio across different combinations of instance numbers and batch sizes.