Table of Contents
Fetching ...

Ranking-Enhanced Anomaly Detection Using Active Learning-Assisted Attention Adversarial Dual AutoEncoders

Sidahmed Benabderrahmane, James Cheney, Talal Rahwan

TL;DR

The paper tackles anomaly detection for Advanced Persistent Threats in environments with scarce labeled data. It introduces ALADAEN, a framework that combines Attention Adversarial Dual AutoEncoders with an active learning loop and GAN-based data augmentation to produce a ranking-oriented anomaly score for SOC triage. Through extensive evaluation on 40 DARPA TC provenance datasets spanning Android, Linux, BSD, and Windows, ALADAEN achieves superior ranking performance ($nDCG$) and practical runtimes compared to state-of-the-art baselines. This approach enables effective detection of rare APT patterns while minimizing labeling effort, offering a scalable solution for real-world cybersecurity operations across diverse OS environments.

Abstract

Advanced Persistent Threats (APTs) pose a significant challenge in cybersecurity due to their stealthy and long-term nature. Modern supervised learning methods require extensive labeled data, which is often scarce in real-world cybersecurity environments. In this paper, we propose an innovative approach that leverages AutoEncoders for unsupervised anomaly detection, augmented by active learning to iteratively improve the detection of APT anomalies. By selectively querying an oracle for labels on uncertain or ambiguous samples, we minimize labeling costs while improving detection rates, enabling the model to improve its detection accuracy with minimal data while reducing the need for extensive manual labeling. We provide a detailed formulation of the proposed Attention Adversarial Dual AutoEncoder-based anomaly detection framework and show how the active learning loop iteratively enhances the model. The framework is evaluated on real-world imbalanced provenance trace databases produced by the DARPA Transparent Computing program, where APT-like attacks constitute as little as 0.004\% of the data. The datasets span multiple operating systems, including Android, Linux, BSD, and Windows, and cover two attack scenarios. The results have shown significant improvements in detection rates during active learning and better performance compared to other existing approaches.

Ranking-Enhanced Anomaly Detection Using Active Learning-Assisted Attention Adversarial Dual AutoEncoders

TL;DR

The paper tackles anomaly detection for Advanced Persistent Threats in environments with scarce labeled data. It introduces ALADAEN, a framework that combines Attention Adversarial Dual AutoEncoders with an active learning loop and GAN-based data augmentation to produce a ranking-oriented anomaly score for SOC triage. Through extensive evaluation on 40 DARPA TC provenance datasets spanning Android, Linux, BSD, and Windows, ALADAEN achieves superior ranking performance () and practical runtimes compared to state-of-the-art baselines. This approach enables effective detection of rare APT patterns while minimizing labeling effort, offering a scalable solution for real-world cybersecurity operations across diverse OS environments.

Abstract

Advanced Persistent Threats (APTs) pose a significant challenge in cybersecurity due to their stealthy and long-term nature. Modern supervised learning methods require extensive labeled data, which is often scarce in real-world cybersecurity environments. In this paper, we propose an innovative approach that leverages AutoEncoders for unsupervised anomaly detection, augmented by active learning to iteratively improve the detection of APT anomalies. By selectively querying an oracle for labels on uncertain or ambiguous samples, we minimize labeling costs while improving detection rates, enabling the model to improve its detection accuracy with minimal data while reducing the need for extensive manual labeling. We provide a detailed formulation of the proposed Attention Adversarial Dual AutoEncoder-based anomaly detection framework and show how the active learning loop iteratively enhances the model. The framework is evaluated on real-world imbalanced provenance trace databases produced by the DARPA Transparent Computing program, where APT-like attacks constitute as little as 0.004\% of the data. The datasets span multiple operating systems, including Android, Linux, BSD, and Windows, and cover two attack scenarios. The results have shown significant improvements in detection rates during active learning and better performance compared to other existing approaches.

Paper Structure

This paper contains 54 sections, 21 equations, 21 figures, 5 tables, 1 algorithm.

Figures (21)

  • Figure 1: Overall architecture of the ALADAEN framework: Active Learning-Assisted Attention Adversarial Dual AutoEncoders for anomaly detection. The framework is composed of three interconnected modules: (1) Data Preparation, which formats provenance events into process-level feature vectors; (2) ADAEN Backbone, a dual autoencoder with attention and adversarial training that learns a robust representation of benign behavior and computes reconstruction-based anomaly scores; and (3) Active Learning & GAN Augmentation, which iteratively selects the most informative unlabeled samples (based on uncertainty), queries them from an oracle, and uses GAN-generated synthetic samples to enrich the benign pool before retraining. This modular design enables continuous refinement and improved anomaly ranking under scarce-label conditions.
  • Figure 2: Step-by-step workflow of ALADAEN. Steps 1-2: Construct process-level feature vectors from the provenance graph and initialize the training set with a small subset of labeled benign data (cold start). Step 3: Train the ADAEN model to reconstruct benign samples and compute anomaly scores based on reconstruction error. Step 4: Rank all unlabeled samples by anomaly score to prioritize potential threats. Step 5: Apply active learning to select the most uncertain samples (i.e., those near the decision threshold) for oracle labeling. Step 6: Augment the newly labeled benign data using a GAN to mitigate data scarcity and retrain the ADAEN model with the enriched dataset. Step 7: Repeat previous steps iteratively until the labeling budget is exhausted. This iterative cycle progressively improves anomaly ranking and detection robustness.
  • Figure 3: nDCG Score Comparison of Anomaly Detection Algorithms Across Operating Systems and Attack Scenarios. The rows represent the anomaly detection methods, while the columns represent the datasets. The subfigures on the left-hand side correspond to the first attack scenario, and the subfigures on the right-hand side correspond to the second attack scenario.
  • Figure 4: Reconstruction Error Histogram with Threshold, where the x-axis represents the reconstruction error values, and the y-axis indicates the frequency (count) of data points with a certain error.
  • Figure 5: nDCG score variation over Active Learning iterations for Linux PA dataset using the ALADAEN framework (Pandex E1 scenario). Values in x-axis represent active learning iterations, whereas y-axis contains nDCG scores. The raw nDCG data is represented by the blue dashed line, while the smoothed nDCG values are depicted by the red dashed line. On the right, a boxplot of the nDCG values shows the distribution of scores throughout the Active Learning process. The figure highlights the model's performance improvements over iterations, stabilizing at higher nDCG scores as more data is incorporated.
  • ...and 16 more figures