Real-time and Zero-footprint Bag of Synthetic Syllables Algorithm for E-mail Spam Detection Using Subject Line and Short Text Fields
Stanislav Selitskiy
TL;DR
The paper tackles the need for real-time, low-footprint spam filtering on e-mail front lines by introducing BoSS, a Bag of Synthetic Syllables approach that encodes short text (such as subject lines) into a compact 146-dimensional feature space and a 189-dim sparse hash, enabling fast proximity checks against known spam. The method uses a synthetic syllabification scheme to produce a low-dimensional representation and compares vectors via $cosine$ distance and $Euclidean$ distance with simple thresholds, all without persistent storage or external resources. Experiments on a one-day of live SMTP traffic show BoSS can operate at near real-time speeds (around $0.02$ s per message) with a small RAM footprint, and it can feed a lightweight perceptron classifier to produce spam verdicts. The authors argue BoSS can unload heavier DL-based approaches by serving as an initial, fast filter and propose future work integrating a multi-perceptron layer to better separate bad spam from grey bulk mail.
Abstract
Contemporary e-mail services have high availability expectations from the customers and are resource-strained because of the high-volume throughput and spam attacks. Deep Machine Learning architectures, which are resource hungry and require off-line processing due to the long processing times, are not acceptable at the front line filters. On the other hand, the bulk of the incoming spam is not sophisticated enough to bypass even the simplest algorithms. While the small fraction of the intelligent, highly mutable spam can be detected only by the deep architectures, the stress on them can be unloaded by the simple near real-time and near zero-footprint algorithms such as the Bag of Synthetic Syllables algorithm applied to the short texts of the e-mail subject lines and other short text fields. The proposed algorithm creates a circa 200 sparse dimensional hash or vector for each e-mail subject line that can be compared for the cosine or euclidean proximity distance to find similarities to the known spammy subjects. The algorithm does not require any persistent storage, dictionaries, additional hardware upgrades or software packages. The performance of the algorithm is presented on the one day of the real SMTP traffic.
