Table of Contents
Fetching ...

Automated Duplicate Bug Report Detection in Large Open Bug Repositories

Clare E. Laney, Andrew Barovic, Armin Moin

TL;DR

This work tackles automatic detection of duplicate bug reports in large open-source repositories by proposing a multi-method framework that blends topic modeling (LDA), information retrieval via cosine similarity, deep learning classifiers, time-based segmentation, clustering, and GPT-based summarization, along with a novel threshold-based duplicate criterion. It evaluates these approaches on the Eclipse Bugzilla BugHub dataset, achieving high accuracy across methods and demonstrating improvements in both binary duplicate detection and generation of candidate duplicates. The study provides an open-source prototype to help open-source projects reduce triage effort and improve issue management by accurately identifying and linking duplicate reports. The findings underscore the practical value of combining topic modeling, IR techniques, and modern NLP tools to streamline bug triage at scale.

Abstract

Many users and contributors of large open-source projects report software defects or enhancement requests (known as bug reports) to the issue-tracking systems. However, they sometimes report issues that have already been reported. First, they may not have time to do sufficient research on existing bug reports. Second, they may not possess the right expertise in that specific area to realize that an existing bug report is essentially elaborating on the same matter, perhaps with a different wording. In this paper, we propose a novel approach based on machine learning methods that can automatically detect duplicate bug reports in an open bug repository based on the textual data in the reports. We present six alternative methods: Topic modeling, Gaussian Naive Bayes, deep learning, time-based organization, clustering, and summarization using a generative pre-trained transformer large language model. Additionally, we introduce a novel threshold-based approach for duplicate identification, in contrast to the conventional top-k selection method that has been widely used in the literature. Our approach demonstrates promising results across all the proposed methods, achieving accuracy rates ranging from the high 70%'s to the low 90%'s. We evaluated our methods on a public dataset of issues belonging to an Eclipse open-source project.

Automated Duplicate Bug Report Detection in Large Open Bug Repositories

TL;DR

This work tackles automatic detection of duplicate bug reports in large open-source repositories by proposing a multi-method framework that blends topic modeling (LDA), information retrieval via cosine similarity, deep learning classifiers, time-based segmentation, clustering, and GPT-based summarization, along with a novel threshold-based duplicate criterion. It evaluates these approaches on the Eclipse Bugzilla BugHub dataset, achieving high accuracy across methods and demonstrating improvements in both binary duplicate detection and generation of candidate duplicates. The study provides an open-source prototype to help open-source projects reduce triage effort and improve issue management by accurately identifying and linking duplicate reports. The findings underscore the practical value of combining topic modeling, IR techniques, and modern NLP tools to streamline bug triage at scale.

Abstract

Many users and contributors of large open-source projects report software defects or enhancement requests (known as bug reports) to the issue-tracking systems. However, they sometimes report issues that have already been reported. First, they may not have time to do sufficient research on existing bug reports. Second, they may not possess the right expertise in that specific area to realize that an existing bug report is essentially elaborating on the same matter, perhaps with a different wording. In this paper, we propose a novel approach based on machine learning methods that can automatically detect duplicate bug reports in an open bug repository based on the textual data in the reports. We present six alternative methods: Topic modeling, Gaussian Naive Bayes, deep learning, time-based organization, clustering, and summarization using a generative pre-trained transformer large language model. Additionally, we introduce a novel threshold-based approach for duplicate identification, in contrast to the conventional top-k selection method that has been widely used in the literature. Our approach demonstrates promising results across all the proposed methods, achieving accuracy rates ranging from the high 70%'s to the low 90%'s. We evaluated our methods on a public dataset of issues belonging to an Eclipse open-source project.

Paper Structure

This paper contains 27 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Reporting a new issue to the Eclipse Bugzilla repository
  • Figure 2: Duplicate (17.6%) vs. non-duplicate (82.4%) bug reports in our dataset
  • Figure 3: Testing different similarity thresholds
  • Figure 4: 85% similarity results for each topic