Table of Contents
Fetching ...

The Dark Side of the Web: Towards Understanding Various Data Sources in Cyber Threat Intelligence

Saskia Laura Schröer, Noé Canevascini, Irdin Pekaric, Philine Widmer, Pavel Laskov

TL;DR

The paper addresses the problem that CTI extraction underutilizes first-hand cybercriminal data due to access challenges and heterogeneity across sources. It proposes a large-scale, cross-source analysis of underground forums, encrypted chat channels, and darknet websites using a dual-dictionary CTI filter and BERTopic-based topic modeling, with an open-source NLP pipeline for reproducibility. The study finds that about 20% of the data is CTI-relevant, with darknet content more technical while forums and chats offer broader strategic discussions; it identifies 83 topics across sources and highlights platform-specific patterns such as Carding and Data Leaks dominating darknet content. The work advances CTI research by providing practical guidance on data-source selection, offers a reusable toolchain for researchers and practitioners, and lays groundwork for more nuanced taxonomy and threat modeling across diverse cybercriminal ecosystems.

Abstract

Cyber threats have become increasingly prevalent and sophisticated. Prior work has extracted actionable cyber threat intelligence (CTI), such as indicators of compromise, tactics, techniques, and procedures (TTPs), or threat feeds from various sources: open source data (e.g., social networks), internal intelligence (e.g., log data), and ``first-hand'' communications from cybercriminals (e.g., underground forums, chats, darknet websites). However, "first-hand" data sources remain underutilized because it is difficult to access or scrape their data. In this work, we analyze (i) 6.6 million posts, (ii) 3.4 million messages, and (iii) 120,000 darknet websites. We combine NLP tools to address several challenges in analyzing such data. First, even on dedicated platforms, only some content is CTI-relevant, requiring effective filtering. Second, "first-hand" data can be CTI-relevant from a technical or strategic viewpoint. We demonstrate how to organize content along this distinction. Third, we describe the topics discussed and how "first-hand" data sources differ from each other. According to our filtering, 20% of our sample is CTI-relevant. Most of the CTI-relevant data focuses on strategic rather than technical discussions. Credit card-related crime is the most prevalent topic on darknet websites. On underground forums and chat channels, account and subscription selling is discussed most. Topic diversity is higher on underground forums and chat channels than on darknet websites. Our analyses suggest that different platforms may be used for activities with varying complexity and risks for criminals.

The Dark Side of the Web: Towards Understanding Various Data Sources in Cyber Threat Intelligence

TL;DR

The paper addresses the problem that CTI extraction underutilizes first-hand cybercriminal data due to access challenges and heterogeneity across sources. It proposes a large-scale, cross-source analysis of underground forums, encrypted chat channels, and darknet websites using a dual-dictionary CTI filter and BERTopic-based topic modeling, with an open-source NLP pipeline for reproducibility. The study finds that about 20% of the data is CTI-relevant, with darknet content more technical while forums and chats offer broader strategic discussions; it identifies 83 topics across sources and highlights platform-specific patterns such as Carding and Data Leaks dominating darknet content. The work advances CTI research by providing practical guidance on data-source selection, offers a reusable toolchain for researchers and practitioners, and lays groundwork for more nuanced taxonomy and threat modeling across diverse cybercriminal ecosystems.

Abstract

Cyber threats have become increasingly prevalent and sophisticated. Prior work has extracted actionable cyber threat intelligence (CTI), such as indicators of compromise, tactics, techniques, and procedures (TTPs), or threat feeds from various sources: open source data (e.g., social networks), internal intelligence (e.g., log data), and ``first-hand'' communications from cybercriminals (e.g., underground forums, chats, darknet websites). However, "first-hand" data sources remain underutilized because it is difficult to access or scrape their data. In this work, we analyze (i) 6.6 million posts, (ii) 3.4 million messages, and (iii) 120,000 darknet websites. We combine NLP tools to address several challenges in analyzing such data. First, even on dedicated platforms, only some content is CTI-relevant, requiring effective filtering. Second, "first-hand" data can be CTI-relevant from a technical or strategic viewpoint. We demonstrate how to organize content along this distinction. Third, we describe the topics discussed and how "first-hand" data sources differ from each other. According to our filtering, 20% of our sample is CTI-relevant. Most of the CTI-relevant data focuses on strategic rather than technical discussions. Credit card-related crime is the most prevalent topic on darknet websites. On underground forums and chat channels, account and subscription selling is discussed most. Topic diversity is higher on underground forums and chat channels than on darknet websites. Our analyses suggest that different platforms may be used for activities with varying complexity and risks for criminals.

Paper Structure

This paper contains 19 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Our NLP Pipeline.The figure describes our NLP pipeline for a comprehensive analysis of heterogeneous data sources. Relevant versus not relevant data items concern their relevance to CTI. For topic modeling we use BERTopic grootendorst2022bertopic. We describe the details of our pipeline in §\ref{['ssec:nlp-pipeline']}.
  • Figure 2: Systematic Literature Review.The figure describes the systematic literature review conducted in March 2024 to identify the state of the art. We exclude survey papers and literature reviews.
  • Figure 3: Summary of Literature Review: Data Sources.We observe a low number of "dark" data sources in the analyzed 27 works. While the number of underground forums seems comparatively high, most of these works examine a single forum rather than multiple ones. Also, we do not identify any paper reviewing chat channels such as Telegram or Discord in the context of CTI extraction. When data sources are used in combination, they are mostly from the clearnet.
  • Figure 4: Summary of Literature Review: NLP Methods.The main NLP method applied in prior work is Text Classification, followed by Topic Modeling. Please note that some papers use a combination of multiple NLP methods.
  • Figure 5: Logarithmic Word Count per Data Item by Data Source.We calculate the logarithm of the words for each data item to define the cutoff of the maximum length. We set the cutoff to 1,000 words. We assume that long data entries discuss the primary topic within the initial 1,000 words of each post, message, or website.
  • ...and 4 more figures