Table of Contents
Fetching ...

GitBugs: Bug Reports for Duplicate Detection, Retrieval Augmented Generation, and More

Avinash Patil

TL;DR

GitBugs addresses limitations in existing bug-report resources by compiling a large-scale, up-to-date corpus of over 150,000 reports from nine open-source projects across GitHub, Bugzilla, and Jira. It standardizes metadata, provides train/test splits for duplicate detection, and includes project-level analytics and reproducible artifacts, enabling tasks such as duplicate detection, bug triaging, resolution time prediction, and retrieval-augmented generation. The paper situates GitBugs against prior datasets, demonstrates its data collection pipeline, and presents a Cassandra case study covering forecasting, classification, time-to-fix prediction, topic modeling, STL decomposition, and a RAG workflow. The dataset aims to facilitate reproducible benchmarking, cross-project research, and practical tooling for software maintenance and automation.

Abstract

Bug reports provide critical insights into software quality, yet existing datasets often suffer from limited scope, outdated content, or insufficient metadata for machine learning. To address these limitations, we present GitBugs-a comprehensive and up-to-date dataset comprising over 150,000 bug reports from nine actively maintained open-source projects, including Firefox, Cassandra, and VS Code. GitBugs aggregates data from Github, Bugzilla and Jira issue trackers, offering standardized categorical fields for classification tasks and predefined train/test splits for duplicate bug detection. In addition, it includes exploratory analysis notebooks and detailed project-level statistics, such as duplicate rates and resolution times. GitBugs supports various software engineering research tasks, including duplicate detection, retrieval augmented generation, resolution prediction, automated triaging, and temporal analysis. The openly licensed dataset provides a valuable cross-project resource for benchmarking and advancing automated bug report analysis. Access the data and code at https://github.com/av9ash/gitbugs/.

GitBugs: Bug Reports for Duplicate Detection, Retrieval Augmented Generation, and More

TL;DR

GitBugs addresses limitations in existing bug-report resources by compiling a large-scale, up-to-date corpus of over 150,000 reports from nine open-source projects across GitHub, Bugzilla, and Jira. It standardizes metadata, provides train/test splits for duplicate detection, and includes project-level analytics and reproducible artifacts, enabling tasks such as duplicate detection, bug triaging, resolution time prediction, and retrieval-augmented generation. The paper situates GitBugs against prior datasets, demonstrates its data collection pipeline, and presents a Cassandra case study covering forecasting, classification, time-to-fix prediction, topic modeling, STL decomposition, and a RAG workflow. The dataset aims to facilitate reproducible benchmarking, cross-project research, and practical tooling for software maintenance and automation.

Abstract

Bug reports provide critical insights into software quality, yet existing datasets often suffer from limited scope, outdated content, or insufficient metadata for machine learning. To address these limitations, we present GitBugs-a comprehensive and up-to-date dataset comprising over 150,000 bug reports from nine actively maintained open-source projects, including Firefox, Cassandra, and VS Code. GitBugs aggregates data from Github, Bugzilla and Jira issue trackers, offering standardized categorical fields for classification tasks and predefined train/test splits for duplicate bug detection. In addition, it includes exploratory analysis notebooks and detailed project-level statistics, such as duplicate rates and resolution times. GitBugs supports various software engineering research tasks, including duplicate detection, retrieval augmented generation, resolution prediction, automated triaging, and temporal analysis. The openly licensed dataset provides a valuable cross-project resource for benchmarking and advancing automated bug report analysis. Access the data and code at https://github.com/av9ash/gitbugs/.

Paper Structure

This paper contains 15 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Monthly bug report trends from 2020 to 2024 across multiple projects. Mozilla Core consistently reports the highest volume, while other projects show lower and more variable trends over time.
  • Figure 2: Kernel density estimates of bug resolution times across multiple projects. Spark shows the fastest resolution, while SeaMonkey has the longest tail, indicating slower bug fixes.
  • Figure 3: Distribution of bug resolution times across projects using box plots. Most projects exhibit a right-skewed distribution with many outliers; Mozilla Core, SeaMonkey, and Thunderbird show notably longer resolution times.
  • Figure 4: Monthly bug report forecasts using ARIMA and Prophet models. Actual data (blue) is shown alongside ARIMA (orange) and Prophet (green) forecasts for 2024–2025.
  • Figure 5: Confusion matrix for bug severity classification on the Cassandra dataset. Most samples are correctly classified as class 2, with moderate confusion between adjacent classes.
  • ...and 4 more figures