Table of Contents
Fetching ...

Data Preparation for Deep Learning based Code Smell Detection: A Systematic Literature Review

Fengji Zhang, Zexian Zhang, Jacky Wai Keung, Xiangru Tang, Zhen Yang, Xiao Yu, Wenhua Hu

TL;DR

This study addresses the overlooked area of data preparation in DL-based code smell detection (CSD) by conducting a systematic literature review of 36 papers up to December 2023. It dissects data requirements, collection, labeling, and cleaning, identifies seven core challenges, and maps five solution strands (cross-project data, two-phase data usage, resampling, semi-automatic labeling, and data cleaning). The authors offer actionable recommendations to diversify languages and data sources, standardize datasets, enhance transparency, and leverage emerging data techniques and large language models. The findings underscore that high-quality, diverse, and well-governed data are essential for robust, generalizable CSD models with real-world impact.

Abstract

Code Smell Detection (CSD) plays a crucial role in improving software quality and maintainability. And Deep Learning (DL) techniques have emerged as a promising approach for CSD due to their superior performance. However, the effectiveness of DL-based CSD methods heavily relies on the quality of the training data. Despite its importance, little attention has been paid to analyzing the data preparation process. This systematic literature review analyzes the data preparation techniques used in DL-based CSD methods. We identify 36 relevant papers published by December 2023 and provide a thorough analysis of the critical considerations in constructing CSD datasets, including data requirements, collection, labeling, and cleaning. We also summarize seven primary challenges and corresponding solutions in the literature. Finally, we offer actionable recommendations for preparing and accessing high-quality CSD data, emphasizing the importance of data diversity, standardization, and accessibility. This survey provides valuable insights for researchers and practitioners to harness the full potential of DL techniques in CSD.

Data Preparation for Deep Learning based Code Smell Detection: A Systematic Literature Review

TL;DR

This study addresses the overlooked area of data preparation in DL-based code smell detection (CSD) by conducting a systematic literature review of 36 papers up to December 2023. It dissects data requirements, collection, labeling, and cleaning, identifies seven core challenges, and maps five solution strands (cross-project data, two-phase data usage, resampling, semi-automatic labeling, and data cleaning). The authors offer actionable recommendations to diversify languages and data sources, standardize datasets, enhance transparency, and leverage emerging data techniques and large language models. The findings underscore that high-quality, diverse, and well-governed data are essential for robust, generalizable CSD models with real-world impact.

Abstract

Code Smell Detection (CSD) plays a crucial role in improving software quality and maintainability. And Deep Learning (DL) techniques have emerged as a promising approach for CSD due to their superior performance. However, the effectiveness of DL-based CSD methods heavily relies on the quality of the training data. Despite its importance, little attention has been paid to analyzing the data preparation process. This systematic literature review analyzes the data preparation techniques used in DL-based CSD methods. We identify 36 relevant papers published by December 2023 and provide a thorough analysis of the critical considerations in constructing CSD datasets, including data requirements, collection, labeling, and cleaning. We also summarize seven primary challenges and corresponding solutions in the literature. Finally, we offer actionable recommendations for preparing and accessing high-quality CSD data, emphasizing the importance of data diversity, standardization, and accessibility. This survey provides valuable insights for researchers and practitioners to harness the full potential of DL techniques in CSD.
Paper Structure (57 sections, 7 figures, 10 tables)

This paper contains 57 sections, 7 figures, 10 tables.

Figures (7)

  • Figure 1: An example of Feature Envy code smell.
  • Figure 2: The overall process of our systematic literature review.
  • Figure 3: The number of primary studies by year.
  • Figure 4: The critical considerations in CSD data preparation (RQ1).The number of papers for each category is indicated.
  • Figure 5: The frequency of programming languages addressed in analyzed primary studies.
  • ...and 2 more figures