Table of Contents
Fetching ...

Defect Prediction Using Stylistic Metrics

Rafed Muhammad Yasir, Ahmedul Kabir

TL;DR

This work addresses defect prediction by incorporating programming stylistic metrics, a novel signal beyond traditional code and process metrics. It analyzes 60 stylistic features across 14 releases from 5 open-source C++ projects, using four classifiers ($NB$, $SVM$, $DT$, $LR$) with SMOTE balancing and $VIF$-based feature pruning, labeling buggy files via bug-fix commits and the $SZZ$ algorithm. Within-project results favor Decision Tree with a mean $F1$ of $78.33\%$, while cross-project results are best with DT and SVM at $F1$ means of $72.07\%$ and $72.57\%$, respectively; 6/9 within-project and 9/14 cross-project cases meet the predefined acceptance thresholds ($Recall>70\%$, $Precision>50\%$). The findings suggest stylistic metrics provide meaningful, complementary signals for defect proneness at the file level and offer a publicly available dataset for future exploration, with planned expansion to more cross-project configurations and integration with traditional defect predictors.

Abstract

Defect prediction is one of the most popular research topics due to its potential to minimize software quality assurance efforts. Existing approaches have examined defect prediction from various perspectives such as complexity and developer metrics. However, none of these consider programming style for defect prediction. This paper aims at analyzing the impact of stylistic metrics on both within-project and crossproject defect prediction. For prediction, 4 widely used machine learning algorithms namely Naive Bayes, Support Vector Machine, Decision Tree and Logistic Regression are used. The experiment is conducted on 14 releases of 5 popular, open source projects. F1, Precision and Recall are inspected to evaluate the results. Results reveal that stylistic metrics are a good predictor of defects.

Defect Prediction Using Stylistic Metrics

TL;DR

This work addresses defect prediction by incorporating programming stylistic metrics, a novel signal beyond traditional code and process metrics. It analyzes 60 stylistic features across 14 releases from 5 open-source C++ projects, using four classifiers (, , , ) with SMOTE balancing and -based feature pruning, labeling buggy files via bug-fix commits and the algorithm. Within-project results favor Decision Tree with a mean of , while cross-project results are best with DT and SVM at means of and , respectively; 6/9 within-project and 9/14 cross-project cases meet the predefined acceptance thresholds (, ). The findings suggest stylistic metrics provide meaningful, complementary signals for defect proneness at the file level and offer a publicly available dataset for future exploration, with planned expansion to more cross-project configurations and integration with traditional defect predictors.

Abstract

Defect prediction is one of the most popular research topics due to its potential to minimize software quality assurance efforts. Existing approaches have examined defect prediction from various perspectives such as complexity and developer metrics. However, none of these consider programming style for defect prediction. This paper aims at analyzing the impact of stylistic metrics on both within-project and crossproject defect prediction. For prediction, 4 widely used machine learning algorithms namely Naive Bayes, Support Vector Machine, Decision Tree and Logistic Regression are used. The experiment is conducted on 14 releases of 5 popular, open source projects. F1, Precision and Recall are inspected to evaluate the results. Results reveal that stylistic metrics are a good predictor of defects.
Paper Structure (11 sections, 3 equations, 1 figure, 5 tables)

This paper contains 11 sections, 3 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Training and Test Data Selection for Within-project and Cross-project Defect Prediction