Improving Requirements Classification with SMOTE-Tomek Preprocessing

Barak Or

Improving Requirements Classification with SMOTE-Tomek Preprocessing

Barak Or

TL;DR

This paper tackles class imbalance in requirements classification by applying SMOTE-Tomek preprocessing within a stratified K-fold CV framework to the PROMISE dataset of 969 requirements. The methodology combines TF-IDF text representations with a suite of classical ML models, enabling robust evaluation while preserving validation integrity. Logistic Regression emerges as the strongest performer under SMOTE-Tomek, achieving $76.16\%$ accuracy and $0.6736$ MCC, up from a baseline of $58.31\%$ accuracy and $0.4181$ MCC, with additional gains from hyperparameter tuning. The approach demonstrates the practicality of clean, scalable imbalanced-text classification and suggests potential extensions to larger datasets and hybrid methods across related domains.

Abstract

This study emphasizes the domain of requirements engineering by applying the SMOTE-Tomek preprocessing technique, combined with stratified K-fold cross-validation, to address class imbalance in the PROMISE dataset. This dataset comprises 969 categorized requirements, classified into functional and non-functional types. The proposed approach enhances the representation of minority classes while maintaining the integrity of validation folds, leading to a notable improvement in classification accuracy. Logistic regression achieved 76.16\%, significantly surpassing the baseline of 58.31\%. These results highlight the applicability and efficiency of machine learning models as scalable and interpretable solutions.

Improving Requirements Classification with SMOTE-Tomek Preprocessing

TL;DR

accuracy and

MCC, up from a baseline of

accuracy and

MCC, with additional gains from hyperparameter tuning. The approach demonstrates the practicality of clean, scalable imbalanced-text classification and suggests potential extensions to larger datasets and hybrid methods across related domains.

Abstract

Paper Structure (16 sections, 5 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 5 equations, 4 figures, 5 tables, 1 algorithm.

Introduction
Learning Method
Dataset
Pre-Processing
Class Imbalance Challenge
SMOTE-Tomek
Stratified K-fold Cross-Validation Method
Trainig Algorithm
Classical ML Models
Results and Discussion
Error Metrics
Baseline Performance Without SMOTE-Tomek
Enhanced Performance Using SMOTE-Tomek
Discussion
Conclusions
...and 1 more sections

Figures (4)

Figure 1: Requirement type distribution in the PROMISE dataset.
Figure 2: Cross validation with carefully applying SMOTE-Tomek only to the validation folds.
Figure 3: Cross-validation performance of machine learning models without SMOTE-Tomek preprocessing.
Figure 4: Cross-validation performance of machine learning models with SMOTE-Tomek preprocessing.

Improving Requirements Classification with SMOTE-Tomek Preprocessing

TL;DR

Abstract

Improving Requirements Classification with SMOTE-Tomek Preprocessing

Authors

TL;DR

Abstract

Table of Contents

Figures (4)