Can We Enhance Bug Report Quality Using LLMs?: An Empirical Study of LLM-Based Bug Report Generation
Jagrit Acharya, Gouri Ginde
TL;DR
The paper tackles the challenge of low-quality bug reports by proposing instruction-fine-tuned LLMs to convert unstructured reports into standardized templates. It compares open-source models (Qwen 2.5, Mistral, Llama 3.2) against ChatGPT-4o using CTQRS, ROUGE, SBERT, and METEOR, finding that Qwen 2.5 achieves a CTQRS of 77% and close performance to ChatGPT with added cross-project generalization up to 70% in unseen projects. The results show strong potential for automatic, privacy-preserving bug-report generation with effective mapping of missing information, especially for Steps-to-Reproduce, while EB/AB missing information are more challenging to detect. The work contributes a public dataset and code, demonstrates cross-project transfer, and highlights practical implications for reducing manual triage effort and accelerating software maintenance.
Abstract
Bug reports contain the information developers need to triage and fix software bugs. However, unclear, incomplete, or ambiguous information may lead to delays and excessive manual effort spent on bug triage and resolution. In this paper, we explore whether Instruction fine-tuned Large Language Models (LLMs) can automatically transform casual, unstructured bug reports into high-quality, structured bug reports adhering to a standard template. We evaluate three open-source instruction-tuned LLMs (\emph{Qwen 2.5, Mistral, and Llama 3.2}) against ChatGPT-4o, measuring performance on established metrics such as CTQRS, ROUGE, METEOR, and SBERT. Our experiments show that fine-tuned Qwen 2.5 achieves a CTQRS score of \textbf{77%}, outperforming both fine-tuned Mistral (\textbf{71%}), Llama 3.2 (\textbf{63%}) and ChatGPT in 3-shot learning (\textbf{75%}). Further analysis reveals that Llama 3.2 shows higher accuracy of detecting missing fields particularly Expected Behavior and Actual Behavior, while Qwen 2.5 demonstrates superior performance in capturing Steps-to-Reproduce, with an F1 score of 76%. Additional testing of the models on other popular projects (e.g., Eclipse, GCC) demonstrates that our approach generalizes well, achieving up to \textbf{70%} CTQRS in unseen projects' bug reports. These findings highlight the potential of instruction fine-tuning in automating structured bug report generation, reducing manual effort for developers and streamlining the software maintenance process.
