Table of Contents
Fetching ...

GenCode: A Generic Data Augmentation Framework for Boosting Deep Learning-Based Code Understanding

Zeming Dong, Qiang Hu, Xiaofei Xie, Maxime Cordy, Mike Papadakis, Yves Le Traon, Jianjun Zhao

TL;DR

GenCode introduces a generation-and-selection data augmentation framework for code understanding that first generates diverse augmented training candidates and then selects the most informative samples using a loss-based influence score. By applying semantic/syntax-preserving and syntax-breaking augmentations and selecting top-$K$ samples per epoch, GenCode consistently improves accuracy and robustness over baselines such as MixCode across multiple tasks and pre-trained code models, with smaller gains for code-specific LLMs. The paper also analyzes how the influence score and data selection strategies affect performance, discusses computational costs and data distribution considerations, and outlines limitations and threats to validity. Overall, GenCode is a general, effective augmentation strategy for both classification and non-classification code understanding tasks, offering practical impact for building more reliable code intelligence systems.

Abstract

Pre-trained code models lead the era of code intelligence, with multiple models designed with impressive performance. However, one important problem, data augmentation for code data that automatically helps developers prepare training data lacks study in this field. In this paper, we introduce a generic data augmentation framework, GenCode, to enhance the training of code understanding models. Simply speaking, GenCode follows a generation-and-selection paradigm to prepare useful training code data. Specifically, it employs code augmentation techniques to generate new code candidates first and then identifies important ones as the training data by influence scores. To evaluate the effectiveness of GenCode, we conduct experiments on four code understanding tasks (e.g., code clone detection) and three pre-trained code models (e.g., CodeT5) and two recent released code-specific Large Language Models (LLMs) (e.g., Qwen2.5-Coder). Compared to the state-of-the-art (SOTA) code augmentation method MixCode, GenCode produces pre-trained code models with 2.92% higher accuracy and 4.90% adversarial robustness on average. For code-specific LLMs, GenCode achieves an average improvement of 0.93% in accuracy and 0.98% in natural robustness.

GenCode: A Generic Data Augmentation Framework for Boosting Deep Learning-Based Code Understanding

TL;DR

GenCode introduces a generation-and-selection data augmentation framework for code understanding that first generates diverse augmented training candidates and then selects the most informative samples using a loss-based influence score. By applying semantic/syntax-preserving and syntax-breaking augmentations and selecting top- samples per epoch, GenCode consistently improves accuracy and robustness over baselines such as MixCode across multiple tasks and pre-trained code models, with smaller gains for code-specific LLMs. The paper also analyzes how the influence score and data selection strategies affect performance, discusses computational costs and data distribution considerations, and outlines limitations and threats to validity. Overall, GenCode is a general, effective augmentation strategy for both classification and non-classification code understanding tasks, offering practical impact for building more reliable code intelligence systems.

Abstract

Pre-trained code models lead the era of code intelligence, with multiple models designed with impressive performance. However, one important problem, data augmentation for code data that automatically helps developers prepare training data lacks study in this field. In this paper, we introduce a generic data augmentation framework, GenCode, to enhance the training of code understanding models. Simply speaking, GenCode follows a generation-and-selection paradigm to prepare useful training code data. Specifically, it employs code augmentation techniques to generate new code candidates first and then identifies important ones as the training data by influence scores. To evaluate the effectiveness of GenCode, we conduct experiments on four code understanding tasks (e.g., code clone detection) and three pre-trained code models (e.g., CodeT5) and two recent released code-specific Large Language Models (LLMs) (e.g., Qwen2.5-Coder). Compared to the state-of-the-art (SOTA) code augmentation method MixCode, GenCode produces pre-trained code models with 2.92% higher accuracy and 4.90% adversarial robustness on average. For code-specific LLMs, GenCode achieves an average improvement of 0.93% in accuracy and 0.98% in natural robustness.
Paper Structure (31 sections, 6 figures, 12 tables, 1 algorithm)

This paper contains 31 sections, 6 figures, 12 tables, 1 algorithm.

Figures (6)

  • Figure 1: Correlation between loss values and code model accuracy.
  • Figure 2: Workflow of GenCode in one training epoch.
  • Figure 3: An example of atomic data augmentation operators.
  • Figure 4: Convergence speed of CodeBERT using different code augmentation methods in each task.
  • Figure 5: Visualization of code embeddings after dimension reduction using Principal Component Analysis (PCA). Model: CodeBERT, dataset: Refactory, task: Bug detection.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Definition 1: Program Code
  • Definition 2: Code Model
  • Definition 3: Code Augmentation
  • Definition 4: Semantic-Preserving (Breaking)
  • Definition 5: Syntax-Preserving (Breaking)