Defect Prediction Using Stylistic Metrics
Rafed Muhammad Yasir, Ahmedul Kabir
TL;DR
This work addresses defect prediction by incorporating programming stylistic metrics, a novel signal beyond traditional code and process metrics. It analyzes 60 stylistic features across 14 releases from 5 open-source C++ projects, using four classifiers ($NB$, $SVM$, $DT$, $LR$) with SMOTE balancing and $VIF$-based feature pruning, labeling buggy files via bug-fix commits and the $SZZ$ algorithm. Within-project results favor Decision Tree with a mean $F1$ of $78.33\%$, while cross-project results are best with DT and SVM at $F1$ means of $72.07\%$ and $72.57\%$, respectively; 6/9 within-project and 9/14 cross-project cases meet the predefined acceptance thresholds ($Recall>70\%$, $Precision>50\%$). The findings suggest stylistic metrics provide meaningful, complementary signals for defect proneness at the file level and offer a publicly available dataset for future exploration, with planned expansion to more cross-project configurations and integration with traditional defect predictors.
Abstract
Defect prediction is one of the most popular research topics due to its potential to minimize software quality assurance efforts. Existing approaches have examined defect prediction from various perspectives such as complexity and developer metrics. However, none of these consider programming style for defect prediction. This paper aims at analyzing the impact of stylistic metrics on both within-project and crossproject defect prediction. For prediction, 4 widely used machine learning algorithms namely Naive Bayes, Support Vector Machine, Decision Tree and Logistic Regression are used. The experiment is conducted on 14 releases of 5 popular, open source projects. F1, Precision and Recall are inspected to evaluate the results. Results reveal that stylistic metrics are a good predictor of defects.
