Table of Contents
Fetching ...

Source Code Data Augmentation for Deep Learning: A Survey

Terry Yue Zhuo, Zhou Yang, Zhensu Sun, Yufei Wang, Li Li, Xiaoning Du, Zhenchang Xing, David Lo

TL;DR

This survey addresses the gap in understanding data augmentation for source code by organizing the field into rule-based, model-based, and example-interpolation approaches. It outlines strategies for improving augmentation quality through method stacking and optimization, and highlights practical scenarios and a wide range of downstream tasks where augmentation improves robustness and generalization. The paper contributes a comprehensive taxonomy, a synthesis of techniques, and a roadmap of challenges and opportunities, including the need for theoretical foundations and standardized benchmarks. By detailing both methods and real-world applications, the work aims to guide researchers in selecting effective augmentation strategies and spur further advancement in source-code DA.

Abstract

The increasingly popular adoption of deep learning models in many critical source code tasks motivates the development of data augmentation (DA) techniques to enhance training data and improve various capabilities (e.g., robustness and generalizability) of these models. Although a series of DA methods have been proposed and tailored for source code models, there lacks a comprehensive survey and examination to understand their effectiveness and implications. This paper fills this gap by conducting a comprehensive and integrative survey of data augmentation for source code, wherein we systematically compile and encapsulate existing literature to provide a comprehensive overview of the field. We start with an introduction of data augmentation in source code and then provide a discussion on major representative approaches. Next, we highlight the general strategies and techniques to optimize the DA quality. Subsequently, we underscore techniques useful in real-world source code scenarios and downstream tasks. Finally, we outline the prevailing challenges and potential opportunities for future research. In essence, we aim to demystify the corpus of existing literature on source code DA for deep learning, and foster further exploration in this sphere. Complementing this, we present a continually updated GitHub repository that hosts a list of update-to-date papers on DA for source code modeling, accessible at \url{https://github.com/terryyz/DataAug4Code}.

Source Code Data Augmentation for Deep Learning: A Survey

TL;DR

This survey addresses the gap in understanding data augmentation for source code by organizing the field into rule-based, model-based, and example-interpolation approaches. It outlines strategies for improving augmentation quality through method stacking and optimization, and highlights practical scenarios and a wide range of downstream tasks where augmentation improves robustness and generalization. The paper contributes a comprehensive taxonomy, a synthesis of techniques, and a roadmap of challenges and opportunities, including the need for theoretical foundations and standardized benchmarks. By detailing both methods and real-world applications, the work aims to guide researchers in selecting effective augmentation strategies and spur further advancement in source-code DA.

Abstract

The increasingly popular adoption of deep learning models in many critical source code tasks motivates the development of data augmentation (DA) techniques to enhance training data and improve various capabilities (e.g., robustness and generalizability) of these models. Although a series of DA methods have been proposed and tailored for source code models, there lacks a comprehensive survey and examination to understand their effectiveness and implications. This paper fills this gap by conducting a comprehensive and integrative survey of data augmentation for source code, wherein we systematically compile and encapsulate existing literature to provide a comprehensive overview of the field. We start with an introduction of data augmentation in source code and then provide a discussion on major representative approaches. Next, we highlight the general strategies and techniques to optimize the DA quality. Subsequently, we underscore techniques useful in real-world source code scenarios and downstream tasks. Finally, we outline the prevailing challenges and potential opportunities for future research. In essence, we aim to demystify the corpus of existing literature on source code DA for deep learning, and foster further exploration in this sphere. Complementing this, we present a continually updated GitHub repository that hosts a list of update-to-date papers on DA for source code modeling, accessible at \url{https://github.com/terryyz/DataAug4Code}.
Paper Structure (42 sections, 4 figures, 1 table)

This paper contains 42 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Yearly publications on the topic of "Source Code DA for Deep Learning". Data Statistics as of November 2023.
  • Figure 2: Venue Distribution of the collected publications.
  • Figure 3: Rule-based DA to transform code snippets, Wang2022TestDrivenML.
  • Figure 4: MixCode, dong2023mixcode.