Table of Contents
Fetching ...

TSTEM: A Cognitive Platform for Collecting Cyber Threat Intelligence in the Wild

Prasasthy Balasubramanian, Sadaf Nazari, Danial Khosh Kholgh, Alireza Mahmoodi, Justin Seby, Panos Kostakos

TL;DR

TSTEM tackles the bottleneck of real-time CTI collection from open sources by combining AI-powered focused crawlers with transformer-based IOC extraction in a containerized, cloud-native architecture. It integrates Kafka-based streaming, Elastic Stack indexing, and IaC/MLOps to deploy and manage CTI pipelines at scale, achieving high accuracy in classification and named entity recognition while maintaining low latency. The work demonstrates robust results, including near-98% sentence-classification accuracy for tweets, ~95% page-classification accuracy, and ~98.7% NER accuracy, with substantial IOC verification across VirusTotal and AlienVault. This platform advances transparent, near real-time CTI sharing and provides a practical path toward automated, end-to-end CTI infrastructure deployment, with future work aimed at broader IOC types, STIX generation, and pipeline automation enhancements.

Abstract

The extraction of cyber threat intelligence (CTI) from open sources is a rapidly expanding defensive strategy that enhances the resilience of both Information Technology (IT) and Operational Technology (OT) environments against large-scale cyber-attacks. While previous research has focused on improving individual components of the extraction process, the community lacks open-source platforms for deploying streaming CTI data pipelines in the wild. To address this gap, the study describes the implementation of an efficient and well-performing platform capable of processing compute-intensive data pipelines based on the cloud computing paradigm for real-time detection, collecting, and sharing CTI from different online sources. We developed a prototype platform (TSTEM), a containerized microservice architecture that uses Tweepy, Scrapy, Terraform, ELK, Kafka, and MLOps to autonomously search, extract, and index IOCs in the wild. Moreover, the provisioning, monitoring, and management of the TSTEM platform are achieved through infrastructure as a code (IaC). Custom focus crawlers collect web content, which is then processed by a first-level classifier to identify potential indicators of compromise (IOCs). If deemed relevant, the content advances to a second level of extraction for further examination. Throughout this process, state-of-the-art NLP models are utilized for classification and entity extraction, enhancing the overall IOC extraction methodology. Our experimental results indicate that these models exhibit high accuracy (exceeding 98%) in the classification and extraction tasks, achieving this performance within a time frame of less than a minute. The effectiveness of our system can be attributed to a finely-tuned IOC extraction method that operates at multiple stages, ensuring precise identification of relevant information with low false positives.

TSTEM: A Cognitive Platform for Collecting Cyber Threat Intelligence in the Wild

TL;DR

TSTEM tackles the bottleneck of real-time CTI collection from open sources by combining AI-powered focused crawlers with transformer-based IOC extraction in a containerized, cloud-native architecture. It integrates Kafka-based streaming, Elastic Stack indexing, and IaC/MLOps to deploy and manage CTI pipelines at scale, achieving high accuracy in classification and named entity recognition while maintaining low latency. The work demonstrates robust results, including near-98% sentence-classification accuracy for tweets, ~95% page-classification accuracy, and ~98.7% NER accuracy, with substantial IOC verification across VirusTotal and AlienVault. This platform advances transparent, near real-time CTI sharing and provides a practical path toward automated, end-to-end CTI infrastructure deployment, with future work aimed at broader IOC types, STIX generation, and pipeline automation enhancements.

Abstract

The extraction of cyber threat intelligence (CTI) from open sources is a rapidly expanding defensive strategy that enhances the resilience of both Information Technology (IT) and Operational Technology (OT) environments against large-scale cyber-attacks. While previous research has focused on improving individual components of the extraction process, the community lacks open-source platforms for deploying streaming CTI data pipelines in the wild. To address this gap, the study describes the implementation of an efficient and well-performing platform capable of processing compute-intensive data pipelines based on the cloud computing paradigm for real-time detection, collecting, and sharing CTI from different online sources. We developed a prototype platform (TSTEM), a containerized microservice architecture that uses Tweepy, Scrapy, Terraform, ELK, Kafka, and MLOps to autonomously search, extract, and index IOCs in the wild. Moreover, the provisioning, monitoring, and management of the TSTEM platform are achieved through infrastructure as a code (IaC). Custom focus crawlers collect web content, which is then processed by a first-level classifier to identify potential indicators of compromise (IOCs). If deemed relevant, the content advances to a second level of extraction for further examination. Throughout this process, state-of-the-art NLP models are utilized for classification and entity extraction, enhancing the overall IOC extraction methodology. Our experimental results indicate that these models exhibit high accuracy (exceeding 98%) in the classification and extraction tasks, achieving this performance within a time frame of less than a minute. The effectiveness of our system can be attributed to a finely-tuned IOC extraction method that operates at multiple stages, ensuring precise identification of relevant information with low false positives.
Paper Structure (44 sections, 7 figures, 10 tables)

This paper contains 44 sections, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Dataflow used to train the AI model to detect IOC from streaming tweets in real-time.
  • Figure 2: Dataflow used to train the AI model to detect IOC from streaming web pages in real-time.
  • Figure 3: Dataflow of shared components and their interactions.
  • Figure 4: Sequence diagram for Twitter crawler
  • Figure 5: Sequence diagram for Web crawler
  • ...and 2 more figures