Table of Contents
Fetching ...

Improving Requirements Classification with SMOTE-Tomek Preprocessing

Barak Or

TL;DR

This paper tackles class imbalance in requirements classification by applying SMOTE-Tomek preprocessing within a stratified K-fold CV framework to the PROMISE dataset of 969 requirements. The methodology combines TF-IDF text representations with a suite of classical ML models, enabling robust evaluation while preserving validation integrity. Logistic Regression emerges as the strongest performer under SMOTE-Tomek, achieving $76.16\%$ accuracy and $0.6736$ MCC, up from a baseline of $58.31\%$ accuracy and $0.4181$ MCC, with additional gains from hyperparameter tuning. The approach demonstrates the practicality of clean, scalable imbalanced-text classification and suggests potential extensions to larger datasets and hybrid methods across related domains.

Abstract

This study emphasizes the domain of requirements engineering by applying the SMOTE-Tomek preprocessing technique, combined with stratified K-fold cross-validation, to address class imbalance in the PROMISE dataset. This dataset comprises 969 categorized requirements, classified into functional and non-functional types. The proposed approach enhances the representation of minority classes while maintaining the integrity of validation folds, leading to a notable improvement in classification accuracy. Logistic regression achieved 76.16\%, significantly surpassing the baseline of 58.31\%. These results highlight the applicability and efficiency of machine learning models as scalable and interpretable solutions.

Improving Requirements Classification with SMOTE-Tomek Preprocessing

TL;DR

This paper tackles class imbalance in requirements classification by applying SMOTE-Tomek preprocessing within a stratified K-fold CV framework to the PROMISE dataset of 969 requirements. The methodology combines TF-IDF text representations with a suite of classical ML models, enabling robust evaluation while preserving validation integrity. Logistic Regression emerges as the strongest performer under SMOTE-Tomek, achieving accuracy and MCC, up from a baseline of accuracy and MCC, with additional gains from hyperparameter tuning. The approach demonstrates the practicality of clean, scalable imbalanced-text classification and suggests potential extensions to larger datasets and hybrid methods across related domains.

Abstract

This study emphasizes the domain of requirements engineering by applying the SMOTE-Tomek preprocessing technique, combined with stratified K-fold cross-validation, to address class imbalance in the PROMISE dataset. This dataset comprises 969 categorized requirements, classified into functional and non-functional types. The proposed approach enhances the representation of minority classes while maintaining the integrity of validation folds, leading to a notable improvement in classification accuracy. Logistic regression achieved 76.16\%, significantly surpassing the baseline of 58.31\%. These results highlight the applicability and efficiency of machine learning models as scalable and interpretable solutions.
Paper Structure (16 sections, 5 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 5 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Requirement type distribution in the PROMISE dataset.
  • Figure 2: Cross validation with carefully applying SMOTE-Tomek only to the validation folds.
  • Figure 3: Cross-validation performance of machine learning models without SMOTE-Tomek preprocessing.
  • Figure 4: Cross-validation performance of machine learning models with SMOTE-Tomek preprocessing.