Table of Contents
Fetching ...

APT-LLM: Embedding-Based Anomaly Detection of Cyber Advanced Persistent Threats Using Large Language Models

Sidahmed Benabderrahmane, Petko Valtchev, James Cheney, Talal Rahwan

TL;DR

The paper tackles the challenge of detecting stealthy APTs in highly imbalanced provenance data. It introduces APT-LLM, which converts low-level process events into textual descriptions and uses pre-trained LLM embeddings (from models such as BERT, ALBERT, RoBERTa, DistilBERT, and MiniLM) combined with unsupervised autoencoders (AE, VAE, DAE) to model normal behavior and identify anomalies. Across DARPA Transparent Computing datasets spanning Android, Linux, BSD, and Windows, the ALBERT–VAE pairing achieves the top AUC (up to 0.95) and generally outperforms classical anomaly detectors, demonstrating the value of semantic embeddings in cybersecurity. The work shows that LLM-derived representations capture nuanced behavioral signatures that improve anomaly detection under extreme class imbalance, offering a scalable approach for real-world APT defense.

Abstract

Advanced Persistent Threats (APTs) pose a major cybersecurity challenge due to their stealth and ability to mimic normal system behavior, making detection particularly difficult in highly imbalanced datasets. Traditional anomaly detection methods struggle to effectively differentiate APT-related activities from benign processes, limiting their applicability in real-world scenarios. This paper introduces APT-LLM, a novel embedding-based anomaly detection framework that integrates large language models (LLMs) -- BERT, ALBERT, DistilBERT, and RoBERTa -- with autoencoder architectures to detect APTs. Unlike prior approaches, which rely on manually engineered features or conventional anomaly detection models, APT-LLM leverages LLMs to encode process-action provenance traces into semantically rich embeddings, capturing nuanced behavioral patterns. These embeddings are analyzed using three autoencoder architectures -- Baseline Autoencoder (AE), Variational Autoencoder (VAE), and Denoising Autoencoder (DAE) -- to model normal process behavior and identify anomalies. The best-performing model is selected for comparison against traditional methods. The framework is evaluated on real-world, highly imbalanced provenance trace datasets from the DARPA Transparent Computing program, where APT-like attacks constitute as little as 0.004\% of the data across multiple operating systems (Android, Linux, BSD, and Windows) and attack scenarios. Results demonstrate that APT-LLM significantly improves detection performance under extreme imbalance conditions, outperforming existing anomaly detection methods and highlighting the effectiveness of LLM-based feature extraction in cybersecurity.

APT-LLM: Embedding-Based Anomaly Detection of Cyber Advanced Persistent Threats Using Large Language Models

TL;DR

The paper tackles the challenge of detecting stealthy APTs in highly imbalanced provenance data. It introduces APT-LLM, which converts low-level process events into textual descriptions and uses pre-trained LLM embeddings (from models such as BERT, ALBERT, RoBERTa, DistilBERT, and MiniLM) combined with unsupervised autoencoders (AE, VAE, DAE) to model normal behavior and identify anomalies. Across DARPA Transparent Computing datasets spanning Android, Linux, BSD, and Windows, the ALBERT–VAE pairing achieves the top AUC (up to 0.95) and generally outperforms classical anomaly detectors, demonstrating the value of semantic embeddings in cybersecurity. The work shows that LLM-derived representations capture nuanced behavioral signatures that improve anomaly detection under extreme class imbalance, offering a scalable approach for real-world APT defense.

Abstract

Advanced Persistent Threats (APTs) pose a major cybersecurity challenge due to their stealth and ability to mimic normal system behavior, making detection particularly difficult in highly imbalanced datasets. Traditional anomaly detection methods struggle to effectively differentiate APT-related activities from benign processes, limiting their applicability in real-world scenarios. This paper introduces APT-LLM, a novel embedding-based anomaly detection framework that integrates large language models (LLMs) -- BERT, ALBERT, DistilBERT, and RoBERTa -- with autoencoder architectures to detect APTs. Unlike prior approaches, which rely on manually engineered features or conventional anomaly detection models, APT-LLM leverages LLMs to encode process-action provenance traces into semantically rich embeddings, capturing nuanced behavioral patterns. These embeddings are analyzed using three autoencoder architectures -- Baseline Autoencoder (AE), Variational Autoencoder (VAE), and Denoising Autoencoder (DAE) -- to model normal process behavior and identify anomalies. The best-performing model is selected for comparison against traditional methods. The framework is evaluated on real-world, highly imbalanced provenance trace datasets from the DARPA Transparent Computing program, where APT-like attacks constitute as little as 0.004\% of the data across multiple operating systems (Android, Linux, BSD, and Windows) and attack scenarios. Results demonstrate that APT-LLM significantly improves detection performance under extreme imbalance conditions, outperforming existing anomaly detection methods and highlighting the effectiveness of LLM-based feature extraction in cybersecurity.

Paper Structure

This paper contains 21 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: T-SNE Visualizations of Embeddings Using Different LLMs. Blue points (label 0) represent normal data, whereas orange points (label 1) represent anomalies. In this example, data belongs to PE dataset of Linux OS and Bovia scenario.
  • Figure 2: Training and Validation Loss Curve for the Autoencoder: The figure shows the decrease in training and validation loss (MSE) over 100 epochs, demonstrating the model's convergence and generalization during training.
  • Figure 3: Scatter Plot of the AutoEncoder Reconstruction Errors by Sample Index: The figure illustrates the reconstruction errors (MSE) for each sample, with the red dashed line indicating the anomaly detection threshold. Points above the threshold represent detected anomalies.
  • Figure 4: Heatmap of AUC Scores for LLM and Autoencoder Combinations: The figure shows the performance of five language models (LLMs) paired with three autoencoder architectures (AE, DAE, VAE) on anomaly detection tasks, with darker shades indicating higher AUC scores. Data belongs to PE dataset of Linux OS and Bovia scenario.
  • Figure 5: ROC Curve Comparison for the Best performing models (PE dataset of Linux OS and Bovia scenario).
  • ...and 1 more figures