You Augment Me: Exploring ChatGPT-based Data Augmentation for Semantic Code Search

Yanlin Wang; Lianghong Guo; Ensheng Shi; Wenqing Chen; Jiachi Chen; Wanjun Zhong; Menghan Wang; Hui Li; Hongyu Zhang; Ziyu Lyu; Zibin Zheng

You Augment Me: Exploring ChatGPT-based Data Augmentation for Semantic Code Search

Yanlin Wang, Lianghong Guo, Ensheng Shi, Wenqing Chen, Jiachi Chen, Wanjun Zhong, Menghan Wang, Hui Li, Hongyu Zhang, Ziyu Lyu, Zibin Zheng

TL;DR

This work tackles the data scarcity and distribution mismatch challenges in code search by introducing ChatDANCE, a three-stage approach that uses ChatGPT to semantically rewrite queries and code, a cross-encoder to filter quality augmented samples, and retraining on the resulting high-quality data. The method yields state-of-the-art results on the CoSQA dataset, boosting UniXcoder’s $MRR$ by $7.0\%$ and $R@1$ by $13.2\%$ over strong baselines, while preserving stability across hyperparameters. Through quantitative metrics and qualitative analyses (e.g., $\,\ell_{align}$ and $\ell_{uniformity}$, t-SNE visualizations), ChatDANCE achieves better alignment and uniformity in the code-query representation space, explaining its superior performance. The practical impact lies in providing a scalable, LLM-driven data augmentation paradigm that can enhance code search and potentially extend to other code intelligence tasks.

Abstract

Code search plays a crucial role in software development, enabling developers to retrieve and reuse code using natural language queries. While the performance of code search models improves with an increase in high-quality data, obtaining such data can be challenging and expensive. Recently, large language models (LLMs) such as ChatGPT have made remarkable progress in both natural and programming language understanding and generation, offering user-friendly interaction via simple prompts. Inspired by these advancements, we propose a novel approach ChatDANCE, which utilizes high-quality and diverse augmented data generated by a large language model and leverages a filtering mechanism to eliminate low-quality augmentations. Specifically, we first propose a set of ChatGPT prompting rules that are specifically designed for source code and queries. Then, we leverage ChatGPT to rewrite code and queries based on the according prompts and then propose a filtering mechanism which trains a cross-encoder from the backbone model UniXcoder to filter out code and query pairs with low matching scores. Finally, we re-train the backbone model using the obtained high-quality augmented data. Experimental results show that ChatDANCE achieves state-of-the-art performance, improving the best baseline by 13.2% (R@1) and 7% (MRR). Surprisingly, we find that this augment-filter-retrain strategy enables the backbone model (UniXcoder) to self-grow. Moreover, extensive experiments show the effectiveness of each component and ChatDANCE has stable performance under different hyperparameter settings. In addition, we conduct qualitative and quantitative analyses to investigate why ChatDANCE works well and find that it learns a more uniform distribution of representations and effectively aligns the code and query spaces.

You Augment Me: Exploring ChatGPT-based Data Augmentation for Semantic Code Search

TL;DR

and

over strong baselines, while preserving stability across hyperparameters. Through quantitative metrics and qualitative analyses (e.g.,

and

, t-SNE visualizations), ChatDANCE achieves better alignment and uniformity in the code-query representation space, explaining its superior performance. The practical impact lies in providing a scalable, LLM-driven data augmentation paradigm that can enhance code search and potentially extend to other code intelligence tasks.

Abstract

Paper Structure (31 sections, 8 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 31 sections, 8 equations, 7 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Code Search
Large Language Model and In-context Learning
Data Augmentation in Code Search
ChatDANCE Framework
The Data Augmentation Stage
Prompt Schema
Prompt Design
Data Augmentation via ChatGPT
The Data Filtering Stage
Bi-encoder & Cross-encoder
Filtering Algorithm
Model Training
Experimental Design
...and 16 more sections

Figures (7)

Figure 1: An overview of ChatDANCE.
Figure 2: Augmented code samples generated by ChatGPT.
Figure 3: The architectures of bi-encoder and cross-encoder.
Figure 4: The top-1 code returned by QRA, NatGen, and ChatDANCE for the query "how to remove blank lines from a text file in python".
Figure 5: The impact of different hyperparameters.
...and 2 more figures

You Augment Me: Exploring ChatGPT-based Data Augmentation for Semantic Code Search

TL;DR

Abstract

You Augment Me: Exploring ChatGPT-based Data Augmentation for Semantic Code Search

Authors

TL;DR

Abstract

Table of Contents

Figures (7)