Inferring Discussion Topics about Exploitation of Vulnerabilities from Underground Hacking Forums
Felipe Moreno-Vera
TL;DR
This work tackles the problem of identifying vulnerability-related discussions that occur in underground hacking forums by applying Latent Dirichlet Allocation to a large, CVE-focused dataset. It combines CrimeBB forum data with NVD CVE information, employing a CVE-based filtering and a four-topic labeling scheme (PoC, Weaponization, Exploitation, Other) to uncover latent themes and their evolution. The study demonstrates that LDA can reveal meaningful topic clusters and overlaps (notably between Weaponization and Exploitation) and highlights the practical value of such analysis for early threat intelligence and vulnerability management. Overall, the approach provides a scalable framework for monitoring real-world exploit discourse and prioritizing defensive measures on emerging vulnerabilities and techniques.
Abstract
The increasing sophistication of cyber threats necessitates proactive measures to identify vulnerabilities and potential exploits. Underground hacking forums serve as breeding grounds for the exchange of hacking techniques and discussions related to exploitation. In this research, we propose an innovative approach using topic modeling to analyze and uncover key themes in vulnerabilities discussed within these forums. The objective of our study is to develop a machine learning-based model that can automatically detect and classify vulnerability-related discussions in underground hacking forums. By monitoring and analyzing the content of these forums, we aim to identify emerging vulnerabilities, exploit techniques, and potential threat actors. To achieve this, we collect a large-scale dataset consisting of posts and threads from multiple underground forums. We preprocess and clean the data to ensure accuracy and reliability. Leveraging topic modeling techniques, specifically Latent Dirichlet Allocation (LDA), we uncover latent topics and their associated keywords within the dataset. This enables us to identify recurring themes and prevalent discussions related to vulnerabilities, exploits, and potential targets.
