CodeRepoQA: A Large-scale Benchmark for Software Engineering Question Answering
Ruida Hu, Chao Peng, Jingyi Ren, Bo Jiang, Xiangxin Meng, Qinyun Wu, Pengfei Gao, Xinchen Wang, Cuiyun Gao
TL;DR
CodeRepoQA tackles the need for realistic, repository-level QA benchmarks in software engineering by introducing a large-scale, multi-turn QA dataset derived from GitHub issues across five programming languages. It details a two-stage data construction process—Raw Data Crawling from 30 high-star repositories and Data Filtering to ensure quality—yielding 585,687 entries with an average of 6.62 dialogue turns. The study evaluates ten diverse models, using historical dialogue context and the maintainer's final reply as ground truth, and analyzes performance with BLEU, ROUGE, and Edit Similarity metrics. Key findings show that larger or commercial models do not consistently outperform open-source variants, and that medium-length context prompts yield the best QA performance, highlighting current limitations and the value of realistic, repository-based benchmarks for software engineering QA research.
Abstract
In this work, we introduce CodeRepoQA, a large-scale benchmark specifically designed for evaluating repository-level question-answering capabilities in the field of software engineering. CodeRepoQA encompasses five programming languages and covers a wide range of scenarios, enabling comprehensive evaluation of language models. To construct this dataset, we crawl data from 30 well-known repositories in GitHub, the largest platform for hosting and collaborating on code, and carefully filter raw data. In total, CodeRepoQA is a multi-turn question-answering benchmark with 585,687 entries, covering a diverse array of software engineering scenarios, with an average of 6.62 dialogue turns per entry. We evaluate ten popular large language models on our dataset and provide in-depth analysis. We find that LLMs still have limitations in question-answering capabilities in the field of software engineering, and medium-length contexts are more conducive to LLMs' performance. The entire benchmark is publicly available at https://github.com/kinesiatricssxilm14/CodeRepoQA.
