Improving Requirements Classification with SMOTE-Tomek Preprocessing
Barak Or
TL;DR
This paper tackles class imbalance in requirements classification by applying SMOTE-Tomek preprocessing within a stratified K-fold CV framework to the PROMISE dataset of 969 requirements. The methodology combines TF-IDF text representations with a suite of classical ML models, enabling robust evaluation while preserving validation integrity. Logistic Regression emerges as the strongest performer under SMOTE-Tomek, achieving $76.16\%$ accuracy and $0.6736$ MCC, up from a baseline of $58.31\%$ accuracy and $0.4181$ MCC, with additional gains from hyperparameter tuning. The approach demonstrates the practicality of clean, scalable imbalanced-text classification and suggests potential extensions to larger datasets and hybrid methods across related domains.
Abstract
This study emphasizes the domain of requirements engineering by applying the SMOTE-Tomek preprocessing technique, combined with stratified K-fold cross-validation, to address class imbalance in the PROMISE dataset. This dataset comprises 969 categorized requirements, classified into functional and non-functional types. The proposed approach enhances the representation of minority classes while maintaining the integrity of validation folds, leading to a notable improvement in classification accuracy. Logistic regression achieved 76.16\%, significantly surpassing the baseline of 58.31\%. These results highlight the applicability and efficiency of machine learning models as scalable and interpretable solutions.
