Table of Contents
Fetching ...

A Robust Cybersecurity Topic Classification Tool

Elijah Pelofske, Lorie M. Liebrock, Vincent Urias

TL;DR

This work tackles the scalable detection of cybersecurity discussions in noisy online text by leveraging user-defined labels from Reddit, StackExchange, and Arxiv to create large labeled datasets. It trains 21 diverse models, evaluates them with both within-source and cross-source validation, and integrates their outputs through a majority-vote Cybersecurity Topic Classification (CTC) tool. The results show that the CTC ensemble generally achieves lower error rates than any single model and scales to hundreds of thousands of documents per hour, making it practical for real-time threat intelligence. The study provides data and code resources to support reproducibility and future improvements in multilingual and cross-domain cyber threat monitoring.

Abstract

In this research, we use user defined labels from three internet text sources (Reddit, Stackexchange, Arxiv) to train 21 different machine learning models for the topic classification task of detecting cybersecurity discussions in natural text. We analyze the false positive and false negative rates of each of the 21 model's in a cross validation experiment. Then we present a Cybersecurity Topic Classification (CTC) tool, which takes the majority vote of the 21 trained machine learning models as the decision mechanism for detecting cybersecurity related text. We also show that the majority vote mechanism of the CTC tool provides lower false negative and false positive rates on average than any of the 21 individual models. We show that the CTC tool is scalable to the hundreds of thousands of documents with a wall clock time on the order of hours.

A Robust Cybersecurity Topic Classification Tool

TL;DR

This work tackles the scalable detection of cybersecurity discussions in noisy online text by leveraging user-defined labels from Reddit, StackExchange, and Arxiv to create large labeled datasets. It trains 21 diverse models, evaluates them with both within-source and cross-source validation, and integrates their outputs through a majority-vote Cybersecurity Topic Classification (CTC) tool. The results show that the CTC ensemble generally achieves lower error rates than any single model and scales to hundreds of thousands of documents per hour, making it practical for real-time threat intelligence. The study provides data and code resources to support reproducibility and future improvements in multilingual and cross-domain cyber threat monitoring.

Abstract

In this research, we use user defined labels from three internet text sources (Reddit, Stackexchange, Arxiv) to train 21 different machine learning models for the topic classification task of detecting cybersecurity discussions in natural text. We analyze the false positive and false negative rates of each of the 21 model's in a cross validation experiment. Then we present a Cybersecurity Topic Classification (CTC) tool, which takes the majority vote of the 21 trained machine learning models as the decision mechanism for detecting cybersecurity related text. We also show that the majority vote mechanism of the CTC tool provides lower false negative and false positive rates on average than any of the 21 individual models. We show that the CTC tool is scalable to the hundreds of thousands of documents with a wall clock time on the order of hours.

Paper Structure

This paper contains 23 sections, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Reddit token data histogram; Cybersecurity documents (left) and non-cybersecurity documents (right).
  • Figure 2: StackExchange token data histogram; Cybersecurity documents (left) and non-cybersecurity documents (right).
  • Figure 3: Arxiv token data histogram; cybersecurity documents (left) and non-cybersecurity documents (right).
  • Figure 4: False negative rate as a function of token length (left) and False positive rate as a function of maximum token length (right).
  • Figure 5: False negative (left) and false positive (right) rates as a function of maximum token length for the StackExchange labelled text
  • ...and 10 more figures