A Robust Cybersecurity Topic Classification Tool
Elijah Pelofske, Lorie M. Liebrock, Vincent Urias
TL;DR
This work tackles the scalable detection of cybersecurity discussions in noisy online text by leveraging user-defined labels from Reddit, StackExchange, and Arxiv to create large labeled datasets. It trains 21 diverse models, evaluates them with both within-source and cross-source validation, and integrates their outputs through a majority-vote Cybersecurity Topic Classification (CTC) tool. The results show that the CTC ensemble generally achieves lower error rates than any single model and scales to hundreds of thousands of documents per hour, making it practical for real-time threat intelligence. The study provides data and code resources to support reproducibility and future improvements in multilingual and cross-domain cyber threat monitoring.
Abstract
In this research, we use user defined labels from three internet text sources (Reddit, Stackexchange, Arxiv) to train 21 different machine learning models for the topic classification task of detecting cybersecurity discussions in natural text. We analyze the false positive and false negative rates of each of the 21 model's in a cross validation experiment. Then we present a Cybersecurity Topic Classification (CTC) tool, which takes the majority vote of the 21 trained machine learning models as the decision mechanism for detecting cybersecurity related text. We also show that the majority vote mechanism of the CTC tool provides lower false negative and false positive rates on average than any of the 21 individual models. We show that the CTC tool is scalable to the hundreds of thousands of documents with a wall clock time on the order of hours.
