Machine Learning Driven Smishing Detection Framework for Mobile Security
Diksha Goel, Hussain Ahmad, Ankit Kumar Jain, Nikhil Kumar Goel
TL;DR
Smishing poses a growing security risk on smartphones due to informal SMS language. The authors propose a two-phase framework that normalizes text using a dictionary-based approach and detects smishing with a Naive Bayes classifier, leveraging Bayes' rule $p(C_k|x) = \frac{p(x|C_k)p(C_k)}{p(x)}$. On a merged dataset comprising 4,807 ham and 362 smishing messages (plus 71 Pinterest smishing samples), they report an accuracy of 96.2%, TPR 97.14%, TNR 96.12%, FPR 3.87%, and FNR 2.85%, outperforming baselines. The work demonstrates the practical viability of on-device smishing detection and outlines future directions including richer normalization, bigger datasets, and URL-level analysis.
Abstract
The increasing reliance on smartphones for communication, financial transactions, and personal data management has made them prime targets for cyberattacks, particularly smishing, a sophisticated variant of phishing conducted via SMS. Despite the growing threat, traditional detection methods often struggle with the informal and evolving nature of SMS language, which includes abbreviations, slang, and short forms. This paper presents an enhanced content-based smishing detection framework that leverages advanced text normalization techniques to improve detection accuracy. By converting nonstandard text into its standardized form, the proposed model enhances the efficacy of machine learning classifiers, particularly the Naive Bayesian classifier, in distinguishing smishing messages from legitimate ones. Our experimental results, validated on a publicly available dataset, demonstrate a detection accuracy of 96.2%, with a low False Positive Rate of 3.87% and False Negative Rate of 2.85%. This approach significantly outperforms existing methodologies, providing a robust solution to the increasingly sophisticated threat of smishing in the mobile environment.
