Table of Contents
Fetching ...

ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search

Zehan Li, Jianfei Zhang, Chuantao Yin, Yuanxin Ouyang, Wenge Rong

TL;DR

This work introduces ProCQA, a large-scale StackOverflow-based mixed-modal programming QA dataset designed for pre-training and evaluating code-text retrieval systems. It proposes modality-agnostic contrastive pre-training (MACP) to align text and code representations, demonstrating substantial improvements over CodeSearchNet-based pretraining across diverse benchmarks. ProCQA comprises approximately 5 million QA pairs spanning 11 languages, paired with strict filtering and decontamination to support fair evaluation and robust pretraining. The results indicate that MACP achieves state-of-the-art performance on text-code and code-code retrieval tasks, including zero-shot and cross-domain scenarios, underscoring the practical value of real-world mixed-modal data for code understanding.

Abstract

Retrieval-based code question answering seeks to match user queries in natural language to relevant code snippets. Previous approaches typically rely on pretraining models using crafted bi-modal and uni-modal datasets to align text and code representations. In this paper, we introduce ProCQA, a large-scale programming question answering dataset extracted from the StackOverflow community, offering naturally structured mixed-modal QA pairs. To validate its effectiveness, we propose a modality-agnostic contrastive pre-training approach to improve the alignment of text and code representations of current code language models. Compared to previous models that primarily employ bimodal and unimodal pairs extracted from CodeSearchNet for pre-training, our model exhibits significant performance improvements across a wide range of code retrieval benchmarks.

ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search

TL;DR

This work introduces ProCQA, a large-scale StackOverflow-based mixed-modal programming QA dataset designed for pre-training and evaluating code-text retrieval systems. It proposes modality-agnostic contrastive pre-training (MACP) to align text and code representations, demonstrating substantial improvements over CodeSearchNet-based pretraining across diverse benchmarks. ProCQA comprises approximately 5 million QA pairs spanning 11 languages, paired with strict filtering and decontamination to support fair evaluation and robust pretraining. The results indicate that MACP achieves state-of-the-art performance on text-code and code-code retrieval tasks, including zero-shot and cross-domain scenarios, underscoring the practical value of real-world mixed-modal data for code understanding.

Abstract

Retrieval-based code question answering seeks to match user queries in natural language to relevant code snippets. Previous approaches typically rely on pretraining models using crafted bi-modal and uni-modal datasets to align text and code representations. In this paper, we introduce ProCQA, a large-scale programming question answering dataset extracted from the StackOverflow community, offering naturally structured mixed-modal QA pairs. To validate its effectiveness, we propose a modality-agnostic contrastive pre-training approach to improve the alignment of text and code representations of current code language models. Compared to previous models that primarily employ bimodal and unimodal pairs extracted from CodeSearchNet for pre-training, our model exhibits significant performance improvements across a wide range of code retrieval benchmarks.
Paper Structure (30 sections, 2 equations, 3 figures, 13 tables)

This paper contains 30 sections, 2 equations, 3 figures, 13 tables.

Figures (3)

  • Figure 1: Illustration of different data formats used for contrastive representation alignment. Color represents chunk modality. Unimodal data focuses on code-to-code matching, while bimodal data emphasizes cross-modal matching. The mixed-modal data in ProCQA enables simultaneous learning of all matching patterns.
  • Figure 2: Question and answer length distribution in ProCQA (C subset).
  • Figure 3: Ablation of the pre-training corpus. Results compared on test sets of CodeSearchNet.