Code Needs Comments: Enhancing Code LLMs with Comment Augmentation
Demin Song, Honglin Guo, Yunhua Zhou, Shuhao Xing, Yudong Wang, Zifan Song, Wenwei Zhang, Qipeng Guo, Hang Yan, Xipeng Qiu, Dahua Lin
TL;DR
This work investigates how NL code alignment is affected by comment density in pre training data and proposes a self supervised data augmentation pipeline that generates and filters code comments to enrich NL aligned code data. By instruction tuning a comment generator and applying constrained, line by line generation with explicit and implicit filtering, the authors build a self augmentation loop that improves multiple code focused LLMs. Experiments on several models and datasets show that higher comment density in training data yields consistent gains on standard code skill benchmarks and that the augmented data can transfer to other models. The approach demonstrates a scalable pathway for code LLMs to self improve by leveraging NL aligned comments, with practical implications for efficiency and data quality in pre training pipelines.
Abstract
The programming skill is one crucial ability for Large Language Models (LLMs), necessitating a deep understanding of programming languages (PLs) and their correlation with natural languages (NLs). We examine the impact of pre-training data on code-focused LLMs' performance by assessing the comment density as a measure of PL-NL alignment. Given the scarcity of code-comment aligned data in pre-training corpora, we introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language. We conducted experiments on three code-focused LLMs and observed consistent improvements in performance on two widely-used programming skill benchmarks. Notably, the model trained on the augmented data outperformed both the model used for generating comments and the model further trained on the data without augmentation.
