Table of Contents
Fetching ...

CUPID: Leveraging ChatGPT for More Accurate Duplicate Bug Report Detection

Ting Zhang, Ivana Clairine Irsan, Ferdian Thung, David Lo

TL;DR

The paper tackles duplicate bug report detection in typical repositories where deep learning methods struggle due to smaller data sizes. It introduces CUPID, a hybrid approach that uses ChatGPT to extract essential keywords from bug reports and feeds them into a traditional retrieval model (REP) to rank potential master reports. CUPID achieves state-of-the-art Recall Rate@10 across Spark, Hadoop, and Kibana datasets (0.602–0.654) and can outperform deep learning baselines by up to 82%, while also studying costs, Open-source LLM viability (Llama-3, Phi-3, OpenChat), and the impact of prompt design. The work demonstrates the practical value of combining LLM-driven feature extraction with robust traditional IR techniques, and provides extensive ablations and a replication package for practitioners.

Abstract

Duplicate bug report detection (DBRD) is a long-standing challenge in both academia and industry. Over the past decades, researchers have proposed various approaches to detect duplicate bug reports more accurately. With the recent advancement of deep learning, researchers have also proposed several deep learning-based approaches to address the DBRD task. In the bug repositories with many bug reports, deep learning-based approaches have shown promising performance. However, in the bug repositories with a smaller number of bug reports, i.e., around 10k, the existing deep learning approaches show worse performance than the traditional approaches. Traditional approaches have limitations, too, e.g., they are usually based on the bag-of-words model, which cannot capture the semantics of bug reports. To address these aforementioned challenges, we seek to leverage a state-of-the-art large language model (LLM) to improve the performance of the traditional DBRD approach. In this paper, we propose an approach called CUPID, which combines the bestperforming traditional DBRD approach (i.e., REP) with the state-of-the-art LLM (i.e., ChatGPT). We conducted an evaluation by comparing CUPID with three existing approaches on three datasets. The experimental results show that CUPID achieves state-of-theart results, reaching Recall Rate@10 scores ranging from 0.602 to 0.654 across all the datasets analyzed. In particular, CUPID improves over the prior state-ofthe-art approach by 5% - 8% in terms of Recall Rate@10 in the datasets. CUPID also surpassed the state-of-the-art deep learning-based DBRD approach by up to 82%.

CUPID: Leveraging ChatGPT for More Accurate Duplicate Bug Report Detection

TL;DR

The paper tackles duplicate bug report detection in typical repositories where deep learning methods struggle due to smaller data sizes. It introduces CUPID, a hybrid approach that uses ChatGPT to extract essential keywords from bug reports and feeds them into a traditional retrieval model (REP) to rank potential master reports. CUPID achieves state-of-the-art Recall Rate@10 across Spark, Hadoop, and Kibana datasets (0.602–0.654) and can outperform deep learning baselines by up to 82%, while also studying costs, Open-source LLM viability (Llama-3, Phi-3, OpenChat), and the impact of prompt design. The work demonstrates the practical value of combining LLM-driven feature extraction with robust traditional IR techniques, and provides extensive ablations and a replication package for practitioners.

Abstract

Duplicate bug report detection (DBRD) is a long-standing challenge in both academia and industry. Over the past decades, researchers have proposed various approaches to detect duplicate bug reports more accurately. With the recent advancement of deep learning, researchers have also proposed several deep learning-based approaches to address the DBRD task. In the bug repositories with many bug reports, deep learning-based approaches have shown promising performance. However, in the bug repositories with a smaller number of bug reports, i.e., around 10k, the existing deep learning approaches show worse performance than the traditional approaches. Traditional approaches have limitations, too, e.g., they are usually based on the bag-of-words model, which cannot capture the semantics of bug reports. To address these aforementioned challenges, we seek to leverage a state-of-the-art large language model (LLM) to improve the performance of the traditional DBRD approach. In this paper, we propose an approach called CUPID, which combines the bestperforming traditional DBRD approach (i.e., REP) with the state-of-the-art LLM (i.e., ChatGPT). We conducted an evaluation by comparing CUPID with three existing approaches on three datasets. The experimental results show that CUPID achieves state-of-theart results, reaching Recall Rate@10 scores ranging from 0.602 to 0.654 across all the datasets analyzed. In particular, CUPID improves over the prior state-ofthe-art approach by 5% - 8% in terms of Recall Rate@10 in the datasets. CUPID also surpassed the state-of-the-art deep learning-based DBRD approach by up to 82%.
Paper Structure (24 sections, 3 equations, 3 figures, 10 tables)

This paper contains 24 sections, 3 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Cupid contains three stages: In Stage 1, it applies selection rules to select the test bug reports that need to be processed; In Stage 2, it utilizes ChatGPT to process the selected bug reports; In Stage 3, it leverages REP to retrieve potential master bug report for each test bug report.
  • Figure 2: Successful prediction Venn diagram
  • Figure 3: The case where Cupid succeeded while REP failed: HADOOP-17091