An Efficient Classification Model for Cyber Text
Md Sakhawat Hossen, Md. Zashid Iqbal Borshon, A. S. M. Badrudduza
TL;DR
The paper tackles the environmental impact of deep learning in text analytics by proposing a lightweight classical ML pipeline that combines a modified TF-IDF variant, CTF-IDF, with IRLBA for efficient dimensionality reduction. CTF-IDF weights rare terms more effectively, while IRLBA provides fast, memory-efficient projection into a compact semantic space, enabling rapid training of SVM and Decision Tree classifiers with competitive performance. On SPAM and SMS Phishing datasets, the approach delivers strong F1-scores and dramatic reductions in training time compared to TF-IDF alone and to transformer baselines, highlighting a practical, greener alternative to deep learning. The work also compares against BERT, showing that substantial accuracy gains can be achieved with substantially lower computational costs, and outlines future directions to enhance adaptability, scalability, and generalizability across domains.
Abstract
The uprising of deep learning methodology and practice in recent years has brought about a severe consequence of increasing carbon footprint due to the insatiable demand for computational resources and power. The field of text analytics also experienced a massive transformation in this trend of monopolizing methodology. In this paper, the original TF-IDF algorithm has been modified, and Clement Term Frequency-Inverse Document Frequency (CTF-IDF) has been proposed for data preprocessing. This paper primarily discusses the effectiveness of classical machine learning techniques in text analytics with CTF-IDF and a faster IRLBA algorithm for dimensionality reduction. The introduction of both of these techniques in the conventional text analytics pipeline ensures a more efficient, faster, and less computationally intensive application when compared with deep learning methodology regarding carbon footprint, with minor compromise in accuracy. The experimental results also exhibit a manifold of reduction in time complexity and improvement of model accuracy for the classical machine learning methods discussed further in this paper.
