Table of Contents
Fetching ...

BugsRepo: A Comprehensive Curated Dataset of Bug Reports, Comments and Contributors Information from Bugzilla

Jagrit Acharya, Gouri Ginde

TL;DR

BugsRepo tackles the problem of incomplete bug reports by delivering a holistic Mozilla Bugzilla-derived dataset that combines bug metadata and full comment histories with rich contributor profiles. It introduces three synchronized data components—a comprehensive bug metadata & comments corpus, a contributor information dataset, and a high-quality, structured bug report subset filtered by regex rules and the CTQRS framework. The CTQRS scoring (max 17) and a 75% quality threshold ensure robust, reproducible data for tasks like bug triage, severity prediction, and summarization, while contributor data enables developer matching and collaboration analyses. This integrated resource, spanning 50+ Mozilla projects and totaling around 4.3 GB, aims to advance automated bug report analysis and practical software maintenance workflows, with replication code and data pipelines provided for reproducibility.

Abstract

Bug reports help software development teams enhance software quality, yet their utility is often compromised by unclear or incomplete information. This issue not only hinders developers' ability to quickly understand and resolve bugs but also poses significant challenges for various software maintenance prediction systems, such as bug triaging, severity prediction, and bug report summarization. To address this issue, we introduce \textnormal{{\fontfamily{ppl}\selectfont BugsRepo}}, a multifaceted dataset derived from Mozilla projects that offers three key components to support a wide range of software maintenance tasks. First, it includes a Bug report meta-data & Comments dataset with detailed records for 119,585 fixed or closed and resolved bug reports, capturing fields like severity, creation time, status, and resolution to provide rich contextual insights. Second, {\fontfamily{ppl}\selectfont BugsRepo} features a contributor information dataset comprising 19,351 Mozilla community members, enriched with metadata on user roles, activity history, and contribution metrics such as the number of bugs filed, comments made, and patches reviewed, thus offering valuable information for tasks like developer recommendation. Lastly, the dataset provides a structured bug report subset of 10,351 well-structured bug reports, complete with steps to reproduce, actual behavior, and expected behavior. After this initial filter, a secondary filtering layer is applied using the CTQRS scale. By integrating static metadata, contributor statistics, and detailed comment threads, {\fontfamily{ppl}\selectfont BugsRepo} presents a holistic view of each bug's history, supporting advancements in automated bug report analysis, which can enhance the efficiency and effectiveness of software maintenance processes.

BugsRepo: A Comprehensive Curated Dataset of Bug Reports, Comments and Contributors Information from Bugzilla

TL;DR

BugsRepo tackles the problem of incomplete bug reports by delivering a holistic Mozilla Bugzilla-derived dataset that combines bug metadata and full comment histories with rich contributor profiles. It introduces three synchronized data components—a comprehensive bug metadata & comments corpus, a contributor information dataset, and a high-quality, structured bug report subset filtered by regex rules and the CTQRS framework. The CTQRS scoring (max 17) and a 75% quality threshold ensure robust, reproducible data for tasks like bug triage, severity prediction, and summarization, while contributor data enables developer matching and collaboration analyses. This integrated resource, spanning 50+ Mozilla projects and totaling around 4.3 GB, aims to advance automated bug report analysis and practical software maintenance workflows, with replication code and data pipelines provided for reproducibility.

Abstract

Bug reports help software development teams enhance software quality, yet their utility is often compromised by unclear or incomplete information. This issue not only hinders developers' ability to quickly understand and resolve bugs but also poses significant challenges for various software maintenance prediction systems, such as bug triaging, severity prediction, and bug report summarization. To address this issue, we introduce \textnormal{{\fontfamily{ppl}\selectfont BugsRepo}}, a multifaceted dataset derived from Mozilla projects that offers three key components to support a wide range of software maintenance tasks. First, it includes a Bug report meta-data & Comments dataset with detailed records for 119,585 fixed or closed and resolved bug reports, capturing fields like severity, creation time, status, and resolution to provide rich contextual insights. Second, {\fontfamily{ppl}\selectfont BugsRepo} features a contributor information dataset comprising 19,351 Mozilla community members, enriched with metadata on user roles, activity history, and contribution metrics such as the number of bugs filed, comments made, and patches reviewed, thus offering valuable information for tasks like developer recommendation. Lastly, the dataset provides a structured bug report subset of 10,351 well-structured bug reports, complete with steps to reproduce, actual behavior, and expected behavior. After this initial filter, a secondary filtering layer is applied using the CTQRS scale. By integrating static metadata, contributor statistics, and detailed comment threads, {\fontfamily{ppl}\selectfont BugsRepo} presents a holistic view of each bug's history, supporting advancements in automated bug report analysis, which can enhance the efficiency and effectiveness of software maintenance processes.

Paper Structure

This paper contains 9 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Entity-Relationship Diagram depicting a many-to-many relationship among bug Report metadata, bug report comments, and contributor information dataset, illustrating columns and data types within our analyzed dataset.
  • Figure 2: Overview of the methodology used to develop various datasets
  • Figure 3: This is an example of a high-quality, well-structured bug report. The report contains complete steps to reproduce, expected behavior, actual behavior and additional information.
  • Figure 4: This is an example of a low-quality bug report, as it does not follow the defined Bugzilla bug report template.
  • Figure 5: Bug reports vs. projects distribution, showing Core and Firefox are projects with most bug reports filed in last 5 years