Table of Contents
Fetching ...

Inferring Discussion Topics about Exploitation of Vulnerabilities from Underground Hacking Forums

Felipe Moreno-Vera

TL;DR

This work tackles the problem of identifying vulnerability-related discussions that occur in underground hacking forums by applying Latent Dirichlet Allocation to a large, CVE-focused dataset. It combines CrimeBB forum data with NVD CVE information, employing a CVE-based filtering and a four-topic labeling scheme (PoC, Weaponization, Exploitation, Other) to uncover latent themes and their evolution. The study demonstrates that LDA can reveal meaningful topic clusters and overlaps (notably between Weaponization and Exploitation) and highlights the practical value of such analysis for early threat intelligence and vulnerability management. Overall, the approach provides a scalable framework for monitoring real-world exploit discourse and prioritizing defensive measures on emerging vulnerabilities and techniques.

Abstract

The increasing sophistication of cyber threats necessitates proactive measures to identify vulnerabilities and potential exploits. Underground hacking forums serve as breeding grounds for the exchange of hacking techniques and discussions related to exploitation. In this research, we propose an innovative approach using topic modeling to analyze and uncover key themes in vulnerabilities discussed within these forums. The objective of our study is to develop a machine learning-based model that can automatically detect and classify vulnerability-related discussions in underground hacking forums. By monitoring and analyzing the content of these forums, we aim to identify emerging vulnerabilities, exploit techniques, and potential threat actors. To achieve this, we collect a large-scale dataset consisting of posts and threads from multiple underground forums. We preprocess and clean the data to ensure accuracy and reliability. Leveraging topic modeling techniques, specifically Latent Dirichlet Allocation (LDA), we uncover latent topics and their associated keywords within the dataset. This enables us to identify recurring themes and prevalent discussions related to vulnerabilities, exploits, and potential targets.

Inferring Discussion Topics about Exploitation of Vulnerabilities from Underground Hacking Forums

TL;DR

This work tackles the problem of identifying vulnerability-related discussions that occur in underground hacking forums by applying Latent Dirichlet Allocation to a large, CVE-focused dataset. It combines CrimeBB forum data with NVD CVE information, employing a CVE-based filtering and a four-topic labeling scheme (PoC, Weaponization, Exploitation, Other) to uncover latent themes and their evolution. The study demonstrates that LDA can reveal meaningful topic clusters and overlaps (notably between Weaponization and Exploitation) and highlights the practical value of such analysis for early threat intelligence and vulnerability management. Overall, the approach provides a scalable framework for monitoring real-world exploit discourse and prioritizing defensive measures on emerging vulnerabilities and techniques.

Abstract

The increasing sophistication of cyber threats necessitates proactive measures to identify vulnerabilities and potential exploits. Underground hacking forums serve as breeding grounds for the exchange of hacking techniques and discussions related to exploitation. In this research, we propose an innovative approach using topic modeling to analyze and uncover key themes in vulnerabilities discussed within these forums. The objective of our study is to develop a machine learning-based model that can automatically detect and classify vulnerability-related discussions in underground hacking forums. By monitoring and analyzing the content of these forums, we aim to identify emerging vulnerabilities, exploit techniques, and potential threat actors. To achieve this, we collect a large-scale dataset consisting of posts and threads from multiple underground forums. We preprocess and clean the data to ensure accuracy and reliability. Leveraging topic modeling techniques, specifically Latent Dirichlet Allocation (LDA), we uncover latent topics and their associated keywords within the dataset. This enables us to identify recurring themes and prevalent discussions related to vulnerabilities, exploits, and potential targets.
Paper Structure (25 sections, 3 equations, 5 figures, 1 table)

This paper contains 25 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: CrimeBB dataset, showing the hierarchical composition of websites, boards, threads, and posts.
  • Figure 2: Text preprocessing pipeline, we show all steps from raw input until vectorization. Note that we only keep and lemmatize words to their basic root form but not all words. This post was taken from the HackForums website.
  • Figure 3: Post concatenation by thread: If at least one post cites a CVE code, we take all others posts from the same thread as one text sample. Otherwise, the complete thread is ignored and excluded from the dataset. This is why we don't use all labeled threads.
  • Figure 4: Common Vulnerability Scoring System (CVSS) severity level, we compare the version 2 and 3.1 of CVSS scores. We note that about 908 969 CVE codes are not found in the CVSS 3.1 version.
  • Figure 5: Topic group projection by principal components. (a) The radius of each group determines the marginal topic distribution, (b) the top 30 most salient terms, (c) the top 30 most relevant terms for topic PoC, (d) the top 30 most relevant terms for topic Weaponization, (e) the top 30 most relevant terms for topic Exploitation, and (f) the top 30 most relevant terms for topic Others.