Table of Contents
Fetching ...

Can We Enhance Bug Report Quality Using LLMs?: An Empirical Study of LLM-Based Bug Report Generation

Jagrit Acharya, Gouri Ginde

TL;DR

The paper tackles the challenge of low-quality bug reports by proposing instruction-fine-tuned LLMs to convert unstructured reports into standardized templates. It compares open-source models (Qwen 2.5, Mistral, Llama 3.2) against ChatGPT-4o using CTQRS, ROUGE, SBERT, and METEOR, finding that Qwen 2.5 achieves a CTQRS of 77% and close performance to ChatGPT with added cross-project generalization up to 70% in unseen projects. The results show strong potential for automatic, privacy-preserving bug-report generation with effective mapping of missing information, especially for Steps-to-Reproduce, while EB/AB missing information are more challenging to detect. The work contributes a public dataset and code, demonstrates cross-project transfer, and highlights practical implications for reducing manual triage effort and accelerating software maintenance.

Abstract

Bug reports contain the information developers need to triage and fix software bugs. However, unclear, incomplete, or ambiguous information may lead to delays and excessive manual effort spent on bug triage and resolution. In this paper, we explore whether Instruction fine-tuned Large Language Models (LLMs) can automatically transform casual, unstructured bug reports into high-quality, structured bug reports adhering to a standard template. We evaluate three open-source instruction-tuned LLMs (\emph{Qwen 2.5, Mistral, and Llama 3.2}) against ChatGPT-4o, measuring performance on established metrics such as CTQRS, ROUGE, METEOR, and SBERT. Our experiments show that fine-tuned Qwen 2.5 achieves a CTQRS score of \textbf{77%}, outperforming both fine-tuned Mistral (\textbf{71%}), Llama 3.2 (\textbf{63%}) and ChatGPT in 3-shot learning (\textbf{75%}). Further analysis reveals that Llama 3.2 shows higher accuracy of detecting missing fields particularly Expected Behavior and Actual Behavior, while Qwen 2.5 demonstrates superior performance in capturing Steps-to-Reproduce, with an F1 score of 76%. Additional testing of the models on other popular projects (e.g., Eclipse, GCC) demonstrates that our approach generalizes well, achieving up to \textbf{70%} CTQRS in unseen projects' bug reports. These findings highlight the potential of instruction fine-tuning in automating structured bug report generation, reducing manual effort for developers and streamlining the software maintenance process.

Can We Enhance Bug Report Quality Using LLMs?: An Empirical Study of LLM-Based Bug Report Generation

TL;DR

The paper tackles the challenge of low-quality bug reports by proposing instruction-fine-tuned LLMs to convert unstructured reports into standardized templates. It compares open-source models (Qwen 2.5, Mistral, Llama 3.2) against ChatGPT-4o using CTQRS, ROUGE, SBERT, and METEOR, finding that Qwen 2.5 achieves a CTQRS of 77% and close performance to ChatGPT with added cross-project generalization up to 70% in unseen projects. The results show strong potential for automatic, privacy-preserving bug-report generation with effective mapping of missing information, especially for Steps-to-Reproduce, while EB/AB missing information are more challenging to detect. The work contributes a public dataset and code, demonstrates cross-project transfer, and highlights practical implications for reducing manual triage effort and accelerating software maintenance.

Abstract

Bug reports contain the information developers need to triage and fix software bugs. However, unclear, incomplete, or ambiguous information may lead to delays and excessive manual effort spent on bug triage and resolution. In this paper, we explore whether Instruction fine-tuned Large Language Models (LLMs) can automatically transform casual, unstructured bug reports into high-quality, structured bug reports adhering to a standard template. We evaluate three open-source instruction-tuned LLMs (\emph{Qwen 2.5, Mistral, and Llama 3.2}) against ChatGPT-4o, measuring performance on established metrics such as CTQRS, ROUGE, METEOR, and SBERT. Our experiments show that fine-tuned Qwen 2.5 achieves a CTQRS score of \textbf{77%}, outperforming both fine-tuned Mistral (\textbf{71%}), Llama 3.2 (\textbf{63%}) and ChatGPT in 3-shot learning (\textbf{75%}). Further analysis reveals that Llama 3.2 shows higher accuracy of detecting missing fields particularly Expected Behavior and Actual Behavior, while Qwen 2.5 demonstrates superior performance in capturing Steps-to-Reproduce, with an F1 score of 76%. Additional testing of the models on other popular projects (e.g., Eclipse, GCC) demonstrates that our approach generalizes well, achieving up to \textbf{70%} CTQRS in unseen projects' bug reports. These findings highlight the potential of instruction fine-tuning in automating structured bug report generation, reducing manual effort for developers and streamlining the software maintenance process.

Paper Structure

This paper contains 13 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: This is an example of a low-quality bug report, as it does not follow the defined Bugzilla bug report template.
  • Figure 2: This is an example of a bug report generated from a fine-tuned Mistral 7B model, based on the unstructured report
  • Figure 3: Architecture for Generating High-Quality Bug Reports from Unstructured Bug Reports Using Fine-Tuned Large Language Models
  • Figure 4: RQ1 and RQ2: Comparing the performance of fine-tuned models with base models and ChatGPT 4o on test dataset
  • Figure 5: RQ3 – Heat-map. Upper part (“Missing info”): shows how accurately the model can flag missing fields (higher = better). Bottom part (“Mapping”): shows how well the model maps content from user text to structured report fields (higher = better).
  • ...and 1 more figures