Table of Contents
Fetching ...

Knowledge Transfer from LLMs to Provenance Analysis: A Semantic-Augmented Method for APT Detection

Fei Zuo, Junghwan Rhee, Yung Ryn Choe

TL;DR

This work tackles the challenge of detecting advanced persistent threats (APTs) by enriching provenance-based analysis with semantic augmentation from large language models (LLMs). It introduces a pipeline that pre-processes system provenance data, generates explanatory, semantically enriched event descriptions with GPT-4o, derives contextualized embeddings, reduces dimensionality with Kernel PCA, and applies supervised and semi-supervised detectors. Empirical results on the ProvSec dataset show near-perfect supervised precision (up to 99.1%) and strong semi-supervised performance (96.9%), with a case study demonstrating generalization to unseen attacks (ROC AUC 97.56%). The findings indicate that transferring LLM knowledge about system calls, software identity, and execution context into embeddings can substantially improve provenance-based APT detection, offering practical benefits for AI-assisted cybersecurity.

Abstract

Advanced Persistent Threats (APTs) have caused significant losses across a wide range of sectors, including the theft of sensitive data and harm to system integrity. As attack techniques grow increasingly sophisticated and stealthy, the arms race between cyber defenders and attackers continues to intensify. The revolutionary impact of Large Language Models (LLMs) has opened up numerous opportunities in various fields, including cybersecurity. An intriguing question arises: can the extensive knowledge embedded in LLMs be harnessed for provenance analysis and play a positive role in identifying previously unknown malicious events? To seek a deeper understanding of this issue, we propose a new strategy for taking advantage of LLMs in provenance-based threat detection. In our design, the state-of-the-art LLM offers additional details in provenance data interpretation, leveraging their knowledge of system calls, software identity, and high-level understanding of application execution context. The advanced contextualized embedding capability is further utilized to capture the rich semantics of event descriptions. We comprehensively examine the quality of the resulting embeddings, and it turns out that they offer promising avenues. Subsequently, machine learning models built upon these embeddings demonstrated outstanding performance on real-world data. In our evaluation, supervised threat detection achieves a precision of 99.0%, and semi-supervised anomaly detection attains a precision of 96.9%.

Knowledge Transfer from LLMs to Provenance Analysis: A Semantic-Augmented Method for APT Detection

TL;DR

This work tackles the challenge of detecting advanced persistent threats (APTs) by enriching provenance-based analysis with semantic augmentation from large language models (LLMs). It introduces a pipeline that pre-processes system provenance data, generates explanatory, semantically enriched event descriptions with GPT-4o, derives contextualized embeddings, reduces dimensionality with Kernel PCA, and applies supervised and semi-supervised detectors. Empirical results on the ProvSec dataset show near-perfect supervised precision (up to 99.1%) and strong semi-supervised performance (96.9%), with a case study demonstrating generalization to unseen attacks (ROC AUC 97.56%). The findings indicate that transferring LLM knowledge about system calls, software identity, and execution context into embeddings can substantially improve provenance-based APT detection, offering practical benefits for AI-assisted cybersecurity.

Abstract

Advanced Persistent Threats (APTs) have caused significant losses across a wide range of sectors, including the theft of sensitive data and harm to system integrity. As attack techniques grow increasingly sophisticated and stealthy, the arms race between cyber defenders and attackers continues to intensify. The revolutionary impact of Large Language Models (LLMs) has opened up numerous opportunities in various fields, including cybersecurity. An intriguing question arises: can the extensive knowledge embedded in LLMs be harnessed for provenance analysis and play a positive role in identifying previously unknown malicious events? To seek a deeper understanding of this issue, we propose a new strategy for taking advantage of LLMs in provenance-based threat detection. In our design, the state-of-the-art LLM offers additional details in provenance data interpretation, leveraging their knowledge of system calls, software identity, and high-level understanding of application execution context. The advanced contextualized embedding capability is further utilized to capture the rich semantics of event descriptions. We comprehensively examine the quality of the resulting embeddings, and it turns out that they offer promising avenues. Subsequently, machine learning models built upon these embeddings demonstrated outstanding performance on real-world data. In our evaluation, supervised threat detection achieves a precision of 99.0%, and semi-supervised anomaly detection attains a precision of 96.9%.

Paper Structure

This paper contains 26 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: A simplified provenance graph example.
  • Figure 2: System overview.
  • Figure 3: Visualization based on t-SNE of event embeddings. The blue points and red points represent benign events and adversary events respectively.
  • Figure 4: The relative positions of eight system events in the embedding space are visualized using the MDS technique.
  • Figure 5: Comparison of different embedding methods for the final threat detection performance.
  • ...and 1 more figures