Table of Contents
Fetching ...

An Empirical Study on Bug Severity Estimation using Source Code Metrics and Static Analysis

Ehsan Mashhadi, Shaiful Chowdhury, Somayeh Modaberi, Hadi Hemmati, Gias Uddin

TL;DR

This study investigates whether source-code metrics and static analysis can predict bug severity at the method level, using Defects4J and Bugs.jar datasets. It finds that 10 metrics reliably indicate bug presence but offer limited ability to distinguish severity, while SpotBugs and Infer exhibit poor bug-detection performance and largely unreliable severity labeling. A qualitative analysis reveals that severity correlates with categories like Security and Integration and that many severe bugs occur in comparatively simple methods, underscoring limitations of pattern-based static analysis. The work suggests combining code metrics with static analysis and exploring dynamic analysis or language-model-based approaches to improve severity prediction and prioritization in practice.

Abstract

In the past couple of decades, significant research efforts have been devoted to the prediction of software bugs (i.e., defects). In general, these works leverage a diverse set of metrics, tools, and techniques to predict which classes, methods, lines, or commits are buggy. However, most existing work in this domain treats all bugs the same, which is not the case in practice. The more severe the bugs the higher their consequences. Therefore, it is important for a defect prediction method to estimate the severity of the identified bugs, so that the higher severity ones get immediate attention. In this paper, we provide a quantitative and qualitative study on two popular datasets (Defects4J and Bugs.jar), using 10 common source code metrics, and two popular static analysis tools (SpotBugs and Infer) for analyzing their capability to predict defects and their severity. We studied 3,358 buggy methods with different severity labels from 19 Java open-source projects. Results show that although code metrics are useful in predicting buggy code (Lines of the Code, Maintainable Index, FanOut, and Effort metrics are the best), they cannot estimate the severity level of the bugs. In addition, we observed that static analysis tools have weak performance in both predicting bugs (F1 score range of 3.1%-7.1%) and their severity label (F1 score under 2%). We also manually studied the characteristics of the severe bugs to identify possible reasons behind the weak performance of code metrics and static analysis tools in estimating their severity. Also, our categorization shows that Security bugs have high severity in most cases while Edge/Boundary faults have low severity. Finally, we discuss the practical implications of the results and propose new directions for future research.

An Empirical Study on Bug Severity Estimation using Source Code Metrics and Static Analysis

TL;DR

This study investigates whether source-code metrics and static analysis can predict bug severity at the method level, using Defects4J and Bugs.jar datasets. It finds that 10 metrics reliably indicate bug presence but offer limited ability to distinguish severity, while SpotBugs and Infer exhibit poor bug-detection performance and largely unreliable severity labeling. A qualitative analysis reveals that severity correlates with categories like Security and Integration and that many severe bugs occur in comparatively simple methods, underscoring limitations of pattern-based static analysis. The work suggests combining code metrics with static analysis and exploring dynamic analysis or language-model-based approaches to improve severity prediction and prioritization in practice.

Abstract

In the past couple of decades, significant research efforts have been devoted to the prediction of software bugs (i.e., defects). In general, these works leverage a diverse set of metrics, tools, and techniques to predict which classes, methods, lines, or commits are buggy. However, most existing work in this domain treats all bugs the same, which is not the case in practice. The more severe the bugs the higher their consequences. Therefore, it is important for a defect prediction method to estimate the severity of the identified bugs, so that the higher severity ones get immediate attention. In this paper, we provide a quantitative and qualitative study on two popular datasets (Defects4J and Bugs.jar), using 10 common source code metrics, and two popular static analysis tools (SpotBugs and Infer) for analyzing their capability to predict defects and their severity. We studied 3,358 buggy methods with different severity labels from 19 Java open-source projects. Results show that although code metrics are useful in predicting buggy code (Lines of the Code, Maintainable Index, FanOut, and Effort metrics are the best), they cannot estimate the severity level of the bugs. In addition, we observed that static analysis tools have weak performance in both predicting bugs (F1 score range of 3.1%-7.1%) and their severity label (F1 score under 2%). We also manually studied the characteristics of the severe bugs to identify possible reasons behind the weak performance of code metrics and static analysis tools in estimating their severity. Also, our categorization shows that Security bugs have high severity in most cases while Edge/Boundary faults have low severity. Finally, we discuss the practical implications of the results and propose new directions for future research.
Paper Structure (33 sections, 4 equations, 24 figures, 13 tables)

This paper contains 33 sections, 4 equations, 24 figures, 13 tables.

Figures (24)

  • Figure 1: Overview of the setup of our empirical study.
  • Figure 2: Buggy methods severity distributions of Defects4J and Bugs.jar datasets with their USL values.
  • Figure 3: Buggy methods severity distributions in Defects4J dataset with the USL values.
  • Figure 4: Buggy methods severity distributions in Bugs.jar dataset with the USL values.
  • Figure 5: Comparing source code metrics between buggy methods (b axis) and non-buggy methods (nb axis) using aggregated dataset.
  • ...and 19 more figures