Table of Contents
Fetching ...

CoSQA: 20,000+ Web Queries for Code Search and Question Answering

Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, Nan Duan

TL;DR

CoSQA fills a gap in code search and code QA by providing a large, real web query–code dataset with robust human annotations. It couples a CodeBERT-based Siamese architecture with a novel CoCLR contrastive learning framework, leveraging in-batch negatives and query-rewritten positives to produce richer training signals. Empirical results show consistent improvements over baselines, with notable gains on CodeXGLUE WebQueryTest and through CoCLR, indicating strong practical impact for alignment of natural language queries with code across tasks. The dataset and methods enable more accurate semantic matching and promise benefits for related tasks such as code summarization and synthesis.

Abstract

Finding codes given natural language query isb eneficial to the productivity of software developers. Future progress towards better semantic matching between query and code requires richer supervised training resources. To remedy this, we introduce the CoSQA dataset.It includes 20,604 labels for pairs of natural language queries and codes, each annotated by at least 3 human annotators. We further introduce a contrastive learning method dubbed CoCLR to enhance query-code matching, which works as a data augmenter to bring more artificially generated training instances. We show that evaluated on CodeXGLUE with the same CodeBERT model, training on CoSQA improves the accuracy of code question answering by 5.1%, and incorporating CoCLR brings a further improvement of 10.5%.

CoSQA: 20,000+ Web Queries for Code Search and Question Answering

TL;DR

CoSQA fills a gap in code search and code QA by providing a large, real web query–code dataset with robust human annotations. It couples a CodeBERT-based Siamese architecture with a novel CoCLR contrastive learning framework, leveraging in-batch negatives and query-rewritten positives to produce richer training signals. Empirical results show consistent improvements over baselines, with notable gains on CodeXGLUE WebQueryTest and through CoCLR, indicating strong practical impact for alignment of natural language queries with code across tasks. The dataset and methods enable more accurate semantic matching and promise benefits for related tasks such as code summarization and synthesis.

Abstract

Finding codes given natural language query isb eneficial to the productivity of software developers. Future progress towards better semantic matching between query and code requires richer supervised training resources. To remedy this, we introduce the CoSQA dataset.It includes 20,604 labels for pairs of natural language queries and codes, each annotated by at least 3 human annotators. We further introduce a contrastive learning method dubbed CoCLR to enhance query-code matching, which works as a data augmenter to bring more artificially generated training instances. We show that evaluated on CodeXGLUE with the same CodeBERT model, training on CoSQA improves the accuracy of code question answering by 5.1%, and incorporating CoCLR brings a further improvement of 10.5%.

Paper Structure

This paper contains 27 sections, 6 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Two examples in CoSQA. A pair of a web query and a Python function with documentation is annotated with "1" or "0", representing whether the code answers the query or not.
  • Figure 2: The frameworks of the siamese network with CodeBERT (left) and our CoCLR method (right). The blue line denotes the original training example. The red lines and dashed lines denote the augmented examples with in-batch augmentation and query-rewritten augmentation, respectively.