SAGA: Synthetic Audit Log Generation for APT Campaigns
Yi-Ting Huang, Ying-Ren Guo, Yu-Sheng Yang, Guo-Wei Wong, Yu-Zih Jheng, Yeali Sun, Jessemyn Modini, Timothy Lynar, Meng Chang Chen
TL;DR
SAGA introduces a fully automated framework to generate configurable, finely labeled synthetic audit logs that emulate real system logs and embed APT campaigns aligned to MITRE ATT&CK. By extracting attack patterns from red-team emulations, abstracting them into templates, and instantiating diverse artifacts, SAGA produces logs suitable for training deep learning and benchmarking multiple detection methods. The authors demonstrate usefulness through intrusion detection, technique hunting, and campaign attribution experiments, showing that models trained on synthetic logs can generalize to unseen techniques and campaigns. This synthetic-data approach provides scalable, scenario-driven benchmarks to advance ML-based defense while acknowledging limitations related to realism and distributional shifts. The work highlights practical impact for evaluating and developing APT detection methods in a controlled, reproducible setting and outlines avenues for further enhancement and cross-platform applicability.
Abstract
With the increasing sophistication of Advanced Persistent Threats (APTs), the demand for effective detection and mitigation strategies and methods has escalated. Program execution leaves traces in the system audit log, which can be analyzed to detect malicious activities. However, collecting and analyzing large volumes of audit logs over extended periods is challenging, further compounded by insufficient labeling that hinders their usability. Addressing these challenges, this paper introduces SAGA (Synthetic Audit log Generation for APT campaigns), a novel approach for generating find-grained labeled synthetic audit logs that mimic real-world system logs while embedding stealthy APT attacks. SAGA generates configurable audit logs for arbitrary duration, blending benign logs from normal operations with malicious logs based on the definitions the MITRE ATT\&CK framework. Malicious audit logs follow an APT lifecycle, incorporating various attack techniques at each stage. These synthetic logs can serve as benchmark datasets for training machine learning models and assessing diverse APT detection methods. To demonstrate the usefulness of synthetic audit logs, we ran established baselines of event-based technique hunting and APT campaign detection using various synthetic audit logs. In addition, we show that a deep learning model trained on synthetic audit logs can detect previously unseen techniques within audit logs.
