Table of Contents
Fetching ...

Code2Doc: A Quality-First Curated Dataset for Code Documentation

Recep Kaan Karaman, Meftun Akarsu

TL;DR

Code2Doc tackles the problem of noisy, duplicated, and potentially AI-generated function-level documentation in open-source datasets. It introduces a four-stage, quality-first curation pipeline that combines basic filtering, multi-dimensional quality scoring, exact/near deduplication via MinHash and LSH, and heuristic AI-content detection to produce 13,358 high-quality pairs across five languages. The authors demonstrate that fine-tuning a Llama 3.1 8B model on Code2Doc yields substantial relative gains in BLEU and ROUGE-L compared with zero-shot performance, despite the dataset’s modest size. Overall, Code2Doc provides a reproducible dataset and pipeline that prioritize data quality to improve documentation generation, while acknowledging limitations in metrics and coverage and calling for richer evaluations and broader multilingual/Domains exploration.

Abstract

The performance of automatic code documentation generation models depends critically on the quality of the training data used for supervision. However, most existing code documentation datasets are constructed through large scale scraping of public repositories with limited quality control. As a result, they often contain noisy documentation, extensive duplication, and increasing contamination from AI generated content. These issues weaken the supervision signal available to learning-based models and complicate evaluation. We introduce \textbf{Code2Doc}, a quality-first curated dataset for function-level code documentation generation. Code2Doc consists of 13,358 high-quality function-documentation pairs extracted from widely used open-source repositories spanning five programming languages: Python, Java, TypeScript, JavaScript, and C++. The dataset is constructed using a four-stage curation pipeline that enforces documentation completeness and clarity, filters functions based on structural and complexity criteria, removes exact and near-duplicate code, and identifies documentation likely to be AI generated. Starting from 52,069 extracted candidates, only 25.6 percent satisfy all quality constraints. We provide a detailed analysis of the resulting dataset, which achieves a mean documentation quality score of 6.93 out of 10. Overall, 86.9% of samples contain explicit type annotations, and only 2.9\% are flagged as potentially AI generated. Baseline experiments show that fine-tuning a large language model on Code2Doc yields relative improvements of 29.47% in BLEU and 24.04% in ROUGE-L over zero shot performance, despite the modest dataset size. We release both the dataset and the full curation pipeline to support reproducible research on automatic code documentation generation.

Code2Doc: A Quality-First Curated Dataset for Code Documentation

TL;DR

Code2Doc tackles the problem of noisy, duplicated, and potentially AI-generated function-level documentation in open-source datasets. It introduces a four-stage, quality-first curation pipeline that combines basic filtering, multi-dimensional quality scoring, exact/near deduplication via MinHash and LSH, and heuristic AI-content detection to produce 13,358 high-quality pairs across five languages. The authors demonstrate that fine-tuning a Llama 3.1 8B model on Code2Doc yields substantial relative gains in BLEU and ROUGE-L compared with zero-shot performance, despite the dataset’s modest size. Overall, Code2Doc provides a reproducible dataset and pipeline that prioritize data quality to improve documentation generation, while acknowledging limitations in metrics and coverage and calling for richer evaluations and broader multilingual/Domains exploration.

Abstract

The performance of automatic code documentation generation models depends critically on the quality of the training data used for supervision. However, most existing code documentation datasets are constructed through large scale scraping of public repositories with limited quality control. As a result, they often contain noisy documentation, extensive duplication, and increasing contamination from AI generated content. These issues weaken the supervision signal available to learning-based models and complicate evaluation. We introduce \textbf{Code2Doc}, a quality-first curated dataset for function-level code documentation generation. Code2Doc consists of 13,358 high-quality function-documentation pairs extracted from widely used open-source repositories spanning five programming languages: Python, Java, TypeScript, JavaScript, and C++. The dataset is constructed using a four-stage curation pipeline that enforces documentation completeness and clarity, filters functions based on structural and complexity criteria, removes exact and near-duplicate code, and identifies documentation likely to be AI generated. Starting from 52,069 extracted candidates, only 25.6 percent satisfy all quality constraints. We provide a detailed analysis of the resulting dataset, which achieves a mean documentation quality score of 6.93 out of 10. Overall, 86.9% of samples contain explicit type annotations, and only 2.9\% are flagged as potentially AI generated. Baseline experiments show that fine-tuning a large language model on Code2Doc yields relative improvements of 29.47% in BLEU and 24.04% in ROUGE-L over zero shot performance, despite the modest dataset size. We release both the dataset and the full curation pipeline to support reproducible research on automatic code documentation generation.

Paper Structure

This paper contains 51 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Distribution of samples across programming languages in the Code2Doc dataset. Java and Python dominate due to stronger documentation practices in mature enterprise and scientific software ecosystems.
  • Figure 2: Documentation quality scores grouped by programming language. Scores are tightly distributed across languages, indicating consistent quality enforcement despite differences in language ecosystems.
  • Figure 3: Distribution of documentation quality scores in Code2Doc. All retained samples exceed the minimum quality threshold of 6.0, with a compact distribution centered around a mean of 6.93.