Table of Contents
Fetching ...

Boosting Source Code Learning with Text-Oriented Data Augmentation: An Empirical Study

Zeming Dong, Qiang Hu, Yuejun Guo, Zhenya Zhang, Maxime Cordy, Mike Papadakis, Yves Le Traon, Jianjun Zhao

TL;DR

This work investigates whether text-oriented data augmentation methods from NLP can improve source code learning. By adapting seven NLP augmentation techniques across three categories—paraphrasing, noising, and sampling—and evaluating 25 methods on four programming tasks with four model architectures (including CodeBERT and GraphCodeBERT), it provides a comprehensive, large-scale empirical assessment. Key findings show that Mixup-based methods like SenMixup can boost accuracy for non-pretrained models (up to about 8.7%), while robustness gains are generally modest; however, data augmentation becomes increasingly beneficial as training data become scarce, with notable improvements in both accuracy and robustness under low-resource conditions. The study also reveals that some syntax-breaking augmentations can be advantageous and that text-oriented methods sometimes outperform code-refactoring baselines, though results are dataset- and model-dependent. Overall, the work offers practical guidance for selecting augmentation strategies, emphasizes the value of embedding-level transformations, and contributes publicly available datasets and code to support future research in data augmentation for code learning.

Abstract

Recent studies have demonstrated remarkable advancements in source code learning, which applies deep neural networks (DNNs) to tackle various software engineering tasks. Similar to other DNN-based domains, source code learning also requires massive high-quality training data to achieve the success of these applications. Data augmentation, a technique used to produce additional training data, is widely adopted in other domains (e.g. computer vision). However, the existing practice of data augmentation in source code learning is limited to simple syntax-preserved methods, such as code refactoring. In this paper, considering that source code can also be represented as text data, we take an early step to investigate the effectiveness of data augmentation methods originally designed for natural language texts in the context of source code learning. To this end, we focus on code classification tasks and conduct a comprehensive empirical study across four critical code problems and four DNN architectures to assess the effectiveness of 25 data augmentation methods. Our results reveal specific data augmentation methods that yield more accurate and robust models for source code learning. Additionally, we discover that the data augmentation methods remain beneficial even when they slightly break source code syntax.

Boosting Source Code Learning with Text-Oriented Data Augmentation: An Empirical Study

TL;DR

This work investigates whether text-oriented data augmentation methods from NLP can improve source code learning. By adapting seven NLP augmentation techniques across three categories—paraphrasing, noising, and sampling—and evaluating 25 methods on four programming tasks with four model architectures (including CodeBERT and GraphCodeBERT), it provides a comprehensive, large-scale empirical assessment. Key findings show that Mixup-based methods like SenMixup can boost accuracy for non-pretrained models (up to about 8.7%), while robustness gains are generally modest; however, data augmentation becomes increasingly beneficial as training data become scarce, with notable improvements in both accuracy and robustness under low-resource conditions. The study also reveals that some syntax-breaking augmentations can be advantageous and that text-oriented methods sometimes outperform code-refactoring baselines, though results are dataset- and model-dependent. Overall, the work offers practical guidance for selecting augmentation strategies, emphasizes the value of embedding-level transformations, and contributes publicly available datasets and code to support future research in data augmentation for code learning.

Abstract

Recent studies have demonstrated remarkable advancements in source code learning, which applies deep neural networks (DNNs) to tackle various software engineering tasks. Similar to other DNN-based domains, source code learning also requires massive high-quality training data to achieve the success of these applications. Data augmentation, a technique used to produce additional training data, is widely adopted in other domains (e.g. computer vision). However, the existing practice of data augmentation in source code learning is limited to simple syntax-preserved methods, such as code refactoring. In this paper, considering that source code can also be represented as text data, we take an early step to investigate the effectiveness of data augmentation methods originally designed for natural language texts in the context of source code learning. To this end, we focus on code classification tasks and conduct a comprehensive empirical study across four critical code problems and four DNN architectures to assess the effectiveness of 25 data augmentation methods. Our results reveal specific data augmentation methods that yield more accurate and robust models for source code learning. Additionally, we discover that the data augmentation methods remain beneficial even when they slightly break source code syntax.
Paper Structure (35 sections, 3 equations, 10 figures, 24 tables)

This paper contains 35 sections, 3 equations, 10 figures, 24 tables.

Figures (10)

  • Figure 1: Overview of our empirical study
  • Figure 2: Data augmentation methods
  • Figure 3: Examples of data augmentation methods from NLP to source code learning, with a code snippet from Python800-p00000-s024467653.py (For each sub-figure, the upper part shows the code without data augmentation, and the lower part shows the code after applying data augmentation.)
  • Figure 4: An example of linear interpolation of two programs
  • Figure 5: Example of a successful PL model attack.
  • ...and 5 more figures