Table of Contents
Fetching ...

Code Smell Detection via Pearson Correlation and ML Hyperparameter Optimization

Moinuddin Muhammad Imtiaz Bhuiyan, Kazi Ekramul Hoque, Rakibul Islam, Md. Mahbubur Rahman Tusher, Najmul Hassan, Yoichi Tomioka, Satoshi Nishimura, Jungpil Shin, Abu Saleh Musa Miah

TL;DR

The paper tackles accurate and generalizable code smell detection in large software systems, where traditional methods struggle with accuracy and cross-dataset generalization. It presents a comprehensive ML pipeline that balances data with SMOTE, reduces feature redundancy with Pearson correlation, and evaluates eight classifiers while applying Grid, Random, and Bayesian hyperparameter optimization. The strongest results come from AdaBoost, Random Forest, and XGBoost, achieving near-perfect accuracy across several smells and outperforming the state-of-the-art Stack-SVM in multiple categories. The work demonstrates a scalable, optimized approach to software quality assurance and provides a robust framework for detecting both class-level and method-level code smells across diverse datasets.

Abstract

This study addresses the challenge of detecting code smells in large-scale software systems using machine learning (ML). Traditional detection methods often suffer from low accuracy and poor generalization across different datasets. To overcome these issues, we propose a machine learning-based model that automatically and accurately identifies code smells, offering a scalable solution for software quality analysis. The novelty of our approach lies in the use of eight diverse ML algorithms, including XGBoost, AdaBoost, and other classifiers, alongside key techniques such as the Synthetic Minority Over-sampling Technique (SMOTE) for class imbalance and Pearson correlation for efficient feature selection. These methods collectively improve model accuracy and generalization. Our methodology involves several steps: first, we preprocess the data and apply SMOTE to balance the dataset; next, Pearson correlation is used for feature selection to reduce redundancy; followed by training eight ML algorithms and tuning hyperparameters through Grid Search, Random Search, and Bayesian Optimization. Finally, we evaluate the models using accuracy, F-measure, and confusion matrices. The results show that AdaBoost, Random Forest, and XGBoost perform best, achieving accuracies of 100%, 99%, and 99%, respectively. This study provides a robust framework for detecting code smells, enhancing software quality assurance, and demonstrating the effectiveness of a comprehensive, optimized ML approach.

Code Smell Detection via Pearson Correlation and ML Hyperparameter Optimization

TL;DR

The paper tackles accurate and generalizable code smell detection in large software systems, where traditional methods struggle with accuracy and cross-dataset generalization. It presents a comprehensive ML pipeline that balances data with SMOTE, reduces feature redundancy with Pearson correlation, and evaluates eight classifiers while applying Grid, Random, and Bayesian hyperparameter optimization. The strongest results come from AdaBoost, Random Forest, and XGBoost, achieving near-perfect accuracy across several smells and outperforming the state-of-the-art Stack-SVM in multiple categories. The work demonstrates a scalable, optimized approach to software quality assurance and provides a robust framework for detecting both class-level and method-level code smells across diverse datasets.

Abstract

This study addresses the challenge of detecting code smells in large-scale software systems using machine learning (ML). Traditional detection methods often suffer from low accuracy and poor generalization across different datasets. To overcome these issues, we propose a machine learning-based model that automatically and accurately identifies code smells, offering a scalable solution for software quality analysis. The novelty of our approach lies in the use of eight diverse ML algorithms, including XGBoost, AdaBoost, and other classifiers, alongside key techniques such as the Synthetic Minority Over-sampling Technique (SMOTE) for class imbalance and Pearson correlation for efficient feature selection. These methods collectively improve model accuracy and generalization. Our methodology involves several steps: first, we preprocess the data and apply SMOTE to balance the dataset; next, Pearson correlation is used for feature selection to reduce redundancy; followed by training eight ML algorithms and tuning hyperparameters through Grid Search, Random Search, and Bayesian Optimization. Finally, we evaluate the models using accuracy, F-measure, and confusion matrices. The results show that AdaBoost, Random Forest, and XGBoost perform best, achieving accuracies of 100%, 99%, and 99%, respectively. This study provides a robust framework for detecting code smells, enhancing software quality assurance, and demonstrating the effectiveness of a comprehensive, optimized ML approach.

Paper Structure

This paper contains 13 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Proposed Model Architecture