Table of Contents
Fetching ...

Uncovering Semantics and Topics Utilized by Threat Actors to Deliver Malicious Attachments and URLs

Andrey Yakymovych, Abhishek Singh

TL;DR

This work investigates how semantics in multilingual emails influence the delivery of malicious attachments and call-to-action URLs. It combines BERTopic with BGE-M3 embeddings, UMAP, and density-based clustering (HDBSCAN/OPTICS), augmented by Phi-3-Mini-4K-Instruct for semantic interpretation and hLDA for thematic analysis, to reveal recurrent threat actor patterns. Key findings show OPTICS-xi produces more topics with high interpretability (295 clusters across 14 categories) and that hierarchical topic modeling with hLDA yields meaningful thematic structure, suggesting semantic-aware signals can enhance threat detection. The methods demonstrate practical potential for improving semantic/-contextual threat cues in detection systems across multilingual email corpora.

Abstract

Recent threat reports highlight that email remains the top vector for delivering malware to endpoints. Despite these statistics, detecting malicious email attachments and URLs often neglects semantic cues linguistic features and contextual clues. Our study employs BERTopic unsupervised topic modeling to identify common semantics and themes embedded in email to deliver malicious attachments and call-to-action URLs. We preprocess emails by extracting and sanitizing content and employ multilingual embedding models like BGE-M3 for dense representations, which clustering algorithms(HDBSCAN and OPTICS) use to group emails by semantic similarity. Phi3-Mini-4K-Instruct facilitates semantic and hLDA aid in thematic analysis to understand threat actor patterns. Our research will evaluate and compare different clustering algorithms on topic quantity, coherence, and diversity metrics, concluding with insights into the semantics and topics commonly used by threat actors to deliver malicious attachments and URLs, a significant contribution to the field of threat detection.

Uncovering Semantics and Topics Utilized by Threat Actors to Deliver Malicious Attachments and URLs

TL;DR

This work investigates how semantics in multilingual emails influence the delivery of malicious attachments and call-to-action URLs. It combines BERTopic with BGE-M3 embeddings, UMAP, and density-based clustering (HDBSCAN/OPTICS), augmented by Phi-3-Mini-4K-Instruct for semantic interpretation and hLDA for thematic analysis, to reveal recurrent threat actor patterns. Key findings show OPTICS-xi produces more topics with high interpretability (295 clusters across 14 categories) and that hierarchical topic modeling with hLDA yields meaningful thematic structure, suggesting semantic-aware signals can enhance threat detection. The methods demonstrate practical potential for improving semantic/-contextual threat cues in detection systems across multilingual email corpora.

Abstract

Recent threat reports highlight that email remains the top vector for delivering malware to endpoints. Despite these statistics, detecting malicious email attachments and URLs often neglects semantic cues linguistic features and contextual clues. Our study employs BERTopic unsupervised topic modeling to identify common semantics and themes embedded in email to deliver malicious attachments and call-to-action URLs. We preprocess emails by extracting and sanitizing content and employ multilingual embedding models like BGE-M3 for dense representations, which clustering algorithms(HDBSCAN and OPTICS) use to group emails by semantic similarity. Phi3-Mini-4K-Instruct facilitates semantic and hLDA aid in thematic analysis to understand threat actor patterns. Our research will evaluate and compare different clustering algorithms on topic quantity, coherence, and diversity metrics, concluding with insights into the semantics and topics commonly used by threat actors to deliver malicious attachments and URLs, a significant contribution to the field of threat detection.
Paper Structure (12 sections, 7 figures, 2 tables)

This paper contains 12 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Email delivering malicious SVG.
  • Figure 2: Diagram of the BERTopic pipeline showing the submodel configuration.
  • Figure 3: Visualizations of BGE-M3 embeddings at multiple zoom levels. These show the variable density and complex shapes of potential clusters.
  • Figure 4: Evaluation results of clustering configurations grouped by minimum cluster size. Scores were averaged over 3 runs for each clustering configuration. One of the OPTICS-xi (minimum cluster size: 50) runs produced an error in the coherence metric calculation, but this didn’t affect the conclusion.
  • Figure 5: Workflow for semantic and thematic analysis.
  • ...and 2 more figures