Table of Contents
Fetching ...

EUREKHA: Enhancing User Representation for Key Hackers Identification in Underground Forums

Abdoul Nasser Hassane Amadou, Anas Motii, Saida Elouardi, EL Houcine Bergou

TL;DR

A novel method called EUREKHA (Enhancing User Representation for Key Hacker Identification in Underground Forums), designed to identify these key hackers by modeling each user as a textual sequence, which demonstrates that fine-tuned LLMs outperform state-of-the-art methods in identifying key hackers.

Abstract

Underground forums serve as hubs for cybercriminal activities, offering a space for anonymity and evasion of conventional online oversight. In these hidden communities, malicious actors collaborate to exchange illicit knowledge, tools, and tactics, driving a range of cyber threats from hacking techniques to the sale of stolen data, malware, and zero-day exploits. Identifying the key instigators (i.e., key hackers), behind these operations is essential but remains a complex challenge. This paper presents a novel method called EUREKHA (Enhancing User Representation for Key Hacker Identification in Underground Forums), designed to identify these key hackers by modeling each user as a textual sequence. This sequence is processed through a large language model (LLM) for domain-specific adaptation, with LLMs acting as feature extractors. These extracted features are then fed into a Graph Neural Network (GNN) to model user structural relationships, significantly improving identification accuracy. Furthermore, we employ BERTopic (Bidirectional Encoder Representations from Transformers Topic Modeling) to extract personalized topics from user-generated content, enabling multiple textual representations per user and optimizing the selection of the most representative sequence. Our study demonstrates that fine-tuned LLMs outperform state-of-the-art methods in identifying key hackers. Additionally, when combined with GNNs, our model achieves significant improvements, resulting in approximately 6% and 10% increases in accuracy and F1-score, respectively, over existing methods. EUREKHA was tested on the Hack-Forums dataset, and we provide open-source access to our code.

EUREKHA: Enhancing User Representation for Key Hackers Identification in Underground Forums

TL;DR

A novel method called EUREKHA (Enhancing User Representation for Key Hacker Identification in Underground Forums), designed to identify these key hackers by modeling each user as a textual sequence, which demonstrates that fine-tuned LLMs outperform state-of-the-art methods in identifying key hackers.

Abstract

Underground forums serve as hubs for cybercriminal activities, offering a space for anonymity and evasion of conventional online oversight. In these hidden communities, malicious actors collaborate to exchange illicit knowledge, tools, and tactics, driving a range of cyber threats from hacking techniques to the sale of stolen data, malware, and zero-day exploits. Identifying the key instigators (i.e., key hackers), behind these operations is essential but remains a complex challenge. This paper presents a novel method called EUREKHA (Enhancing User Representation for Key Hacker Identification in Underground Forums), designed to identify these key hackers by modeling each user as a textual sequence. This sequence is processed through a large language model (LLM) for domain-specific adaptation, with LLMs acting as feature extractors. These extracted features are then fed into a Graph Neural Network (GNN) to model user structural relationships, significantly improving identification accuracy. Furthermore, we employ BERTopic (Bidirectional Encoder Representations from Transformers Topic Modeling) to extract personalized topics from user-generated content, enabling multiple textual representations per user and optimizing the selection of the most representative sequence. Our study demonstrates that fine-tuned LLMs outperform state-of-the-art methods in identifying key hackers. Additionally, when combined with GNNs, our model achieves significant improvements, resulting in approximately 6% and 10% increases in accuracy and F1-score, respectively, over existing methods. EUREKHA was tested on the Hack-Forums dataset, and we provide open-source access to our code.

Paper Structure

This paper contains 31 sections, 4 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Overview of our proposed framework EUREKHA.
  • Figure 2: The workflow for modeling topics using BERTopic.
  • Figure 3: User sequence representation example.
  • Figure 4: Network schema of Hack-Forums.
  • Figure 5: Best hyperparameter configurations for ALBERT, BERT, RoBERTa, and XLNet on the R3 user sequence representation.
  • ...and 3 more figures