Bandit on the Hunt: Dynamic Crawling for Cyber Threat Intelligence
Philipp Kuehn, Dilara Nadermahmoodi, Markus Bayer, Christian Reuter
TL;DR
The paper tackles the challenge of acquiring CTI from the vast, unstructured web by automating source discovery beyond known sites. It introduces ThreatCrawl, a one-step focused crawler that combines SBERT embeddings for relevance assessment with a multi-armed bandit (MAB), specifically using $\text{UCB1}$, to dynamically select among forward, backward, and keyword search actions. Empirical results show harvest rates exceeding $25\%$ and seed expansion of more than $300\%$ starting from as few as 17 seeds, while identifying numerous new CTI-relevant domains and pages. Limitations include dependence on SBERT and runtime constraints, with future work proposed on dynamic thresholds, larger models, and graph-based analyses to further enhance discovery and robustness.
Abstract
Public information contains valuable Cyber Threat Intelligence (CTI) that is used to prevent future attacks. While standards exist for sharing this information, much appears in non-standardized news articles or blogs. Monitoring online sources for threats is time-consuming and source selection is uncertain. Current research focuses on extracting Indicators of Compromise from known sources, rarely addressing new source identification. This paper proposes a CTI-focused crawler using multi-armed bandit (MAB) and various crawling strategies. It employs SBERT to identify relevant documents while dynamically adapting its crawling path. Our system ThreatCrawl achieves a harvest rate exceeding 25% and expands its seed by over 300% while maintaining topical focus. Additionally, the crawler identifies previously unknown but highly relevant overview pages, datasets, and domains.
