Table of Contents
Fetching ...

An Empirical Study on the Classification of Bug Reports with Machine Learning

Renato Andrade, César Teixeira, Nuno Laranjeiro, Marco Vieira

TL;DR

This study tackles the problem of misclassified bug reports by conducting a large-scale empirical evaluation of classical ML classifiers on a heterogeneous dataset of $N=661{,}431$ issue reports from $52$ projects across $10$ languages and three ITSs. Using a TF-IDF Bag-of-Words representation with chi-squared feature selection, the authors compare five classifiers (k-NN, NB, SVM, RF, LR) and find that SVM, LR, and RF consistently outperform NB and KNN, with $F_1$ roughly $0.67$–$0.68$ on average. They also systematically analyze the influence of report content (title vs description), programming language, ITS, and cross-project generalization, observing that 250 dimensions are sufficient and that language/ITS significantly impact performance, while cross-project transfer is feasible when language and ITS are fixed. The paper culminates in practical guidelines for future studies, emphasizing heterogeneous data, balanced training, and targeted evaluation metrics, and highlights directions for interpretability and broader feature sources. Overall, the work provides robust evidence and actionable guidance for building more reliable bug-report classifiers in heterogeneous, real-world settings.

Abstract

Software defects are a major threat to the reliability of computer systems. The literature shows that more than 30% of bug reports submitted in large software projects are misclassified (i.e., are feature requests, or mistakes made by the bug reporter), leading developers to place great effort in manually inspecting them. Machine Learning algorithms can be used for the automatic classification of issue reports. Still, little is known regarding key aspects of training models, such as the influence of programming languages and issue tracking systems. In this paper, we use a dataset containing more than 660,000 issue reports, collected from heterogeneous projects hosted in different issue tracking systems, to study how different factors (e.g., project language, report content) can influence the performance of models in handling classification of issue reports. Results show that using the report title or description does not significantly differ; Support Vector Machine, Logistic Regression, and Random Forest are effective in classifying issue reports; programming languages and issue tracking systems influence classification outcomes; and models based on heterogeneous projects can classify reports from projects not present during training. Based on findings, we propose guidelines for future research, including recommendations for using heterogeneous data and selecting high-performing algorithms.

An Empirical Study on the Classification of Bug Reports with Machine Learning

TL;DR

This study tackles the problem of misclassified bug reports by conducting a large-scale empirical evaluation of classical ML classifiers on a heterogeneous dataset of issue reports from projects across languages and three ITSs. Using a TF-IDF Bag-of-Words representation with chi-squared feature selection, the authors compare five classifiers (k-NN, NB, SVM, RF, LR) and find that SVM, LR, and RF consistently outperform NB and KNN, with roughly on average. They also systematically analyze the influence of report content (title vs description), programming language, ITS, and cross-project generalization, observing that 250 dimensions are sufficient and that language/ITS significantly impact performance, while cross-project transfer is feasible when language and ITS are fixed. The paper culminates in practical guidelines for future studies, emphasizing heterogeneous data, balanced training, and targeted evaluation metrics, and highlights directions for interpretability and broader feature sources. Overall, the work provides robust evidence and actionable guidance for building more reliable bug-report classifiers in heterogeneous, real-world settings.

Abstract

Software defects are a major threat to the reliability of computer systems. The literature shows that more than 30% of bug reports submitted in large software projects are misclassified (i.e., are feature requests, or mistakes made by the bug reporter), leading developers to place great effort in manually inspecting them. Machine Learning algorithms can be used for the automatic classification of issue reports. Still, little is known regarding key aspects of training models, such as the influence of programming languages and issue tracking systems. In this paper, we use a dataset containing more than 660,000 issue reports, collected from heterogeneous projects hosted in different issue tracking systems, to study how different factors (e.g., project language, report content) can influence the performance of models in handling classification of issue reports. Results show that using the report title or description does not significantly differ; Support Vector Machine, Logistic Regression, and Random Forest are effective in classifying issue reports; programming languages and issue tracking systems influence classification outcomes; and models based on heterogeneous projects can classify reports from projects not present during training. Based on findings, we propose guidelines for future research, including recommendations for using heterogeneous data and selecting high-performing algorithms.

Paper Structure

This paper contains 16 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Steps of a typical ML pipeline
  • Figure 2: Mean F-measure by number of dimensions.
  • Figure 3: Performance comparison - titles vs. descriptions.
  • Figure 4: F-score comparison between NB, LR, RF, SVM and KNN algorithms.
  • Figure 5: F-score comparison between programming languages.
  • ...and 3 more figures