The Dark Side of the Web: Towards Understanding Various Data Sources in Cyber Threat Intelligence
Saskia Laura Schröer, Noé Canevascini, Irdin Pekaric, Philine Widmer, Pavel Laskov
TL;DR
The paper addresses the problem that CTI extraction underutilizes first-hand cybercriminal data due to access challenges and heterogeneity across sources. It proposes a large-scale, cross-source analysis of underground forums, encrypted chat channels, and darknet websites using a dual-dictionary CTI filter and BERTopic-based topic modeling, with an open-source NLP pipeline for reproducibility. The study finds that about 20% of the data is CTI-relevant, with darknet content more technical while forums and chats offer broader strategic discussions; it identifies 83 topics across sources and highlights platform-specific patterns such as Carding and Data Leaks dominating darknet content. The work advances CTI research by providing practical guidance on data-source selection, offers a reusable toolchain for researchers and practitioners, and lays groundwork for more nuanced taxonomy and threat modeling across diverse cybercriminal ecosystems.
Abstract
Cyber threats have become increasingly prevalent and sophisticated. Prior work has extracted actionable cyber threat intelligence (CTI), such as indicators of compromise, tactics, techniques, and procedures (TTPs), or threat feeds from various sources: open source data (e.g., social networks), internal intelligence (e.g., log data), and ``first-hand'' communications from cybercriminals (e.g., underground forums, chats, darknet websites). However, "first-hand" data sources remain underutilized because it is difficult to access or scrape their data. In this work, we analyze (i) 6.6 million posts, (ii) 3.4 million messages, and (iii) 120,000 darknet websites. We combine NLP tools to address several challenges in analyzing such data. First, even on dedicated platforms, only some content is CTI-relevant, requiring effective filtering. Second, "first-hand" data can be CTI-relevant from a technical or strategic viewpoint. We demonstrate how to organize content along this distinction. Third, we describe the topics discussed and how "first-hand" data sources differ from each other. According to our filtering, 20% of our sample is CTI-relevant. Most of the CTI-relevant data focuses on strategic rather than technical discussions. Credit card-related crime is the most prevalent topic on darknet websites. On underground forums and chat channels, account and subscription selling is discussed most. Topic diversity is higher on underground forums and chat channels than on darknet websites. Our analyses suggest that different platforms may be used for activities with varying complexity and risks for criminals.
