GenCode: A Generic Data Augmentation Framework for Boosting Deep Learning-Based Code Understanding
Zeming Dong, Qiang Hu, Xiaofei Xie, Maxime Cordy, Mike Papadakis, Yves Le Traon, Jianjun Zhao
TL;DR
GenCode introduces a generation-and-selection data augmentation framework for code understanding that first generates diverse augmented training candidates and then selects the most informative samples using a loss-based influence score. By applying semantic/syntax-preserving and syntax-breaking augmentations and selecting top-$K$ samples per epoch, GenCode consistently improves accuracy and robustness over baselines such as MixCode across multiple tasks and pre-trained code models, with smaller gains for code-specific LLMs. The paper also analyzes how the influence score and data selection strategies affect performance, discusses computational costs and data distribution considerations, and outlines limitations and threats to validity. Overall, GenCode is a general, effective augmentation strategy for both classification and non-classification code understanding tasks, offering practical impact for building more reliable code intelligence systems.
Abstract
Pre-trained code models lead the era of code intelligence, with multiple models designed with impressive performance. However, one important problem, data augmentation for code data that automatically helps developers prepare training data lacks study in this field. In this paper, we introduce a generic data augmentation framework, GenCode, to enhance the training of code understanding models. Simply speaking, GenCode follows a generation-and-selection paradigm to prepare useful training code data. Specifically, it employs code augmentation techniques to generate new code candidates first and then identifies important ones as the training data by influence scores. To evaluate the effectiveness of GenCode, we conduct experiments on four code understanding tasks (e.g., code clone detection) and three pre-trained code models (e.g., CodeT5) and two recent released code-specific Large Language Models (LLMs) (e.g., Qwen2.5-Coder). Compared to the state-of-the-art (SOTA) code augmentation method MixCode, GenCode produces pre-trained code models with 2.92% higher accuracy and 4.90% adversarial robustness on average. For code-specific LLMs, GenCode achieves an average improvement of 0.93% in accuracy and 0.98% in natural robustness.
