AnomalyGen: An Automated Semantic Log Sequence Generation Framework with LLM for Anomaly Detection
Xinyu Li, Yingtong Huo, Chenxi Mao, Shiwen Shan, Yuxin Su, Dan Li, Zibin Zheng
TL;DR
AnomalyGen addresses the bottleneck of scarce high-quality log datasets by introducing a four-phase framework that combines enhanced static program analysis with large-language-model chain-of-thought reasoning to synthesize semantically rich anomaly logs without executing systems. The methodology prunes log-relevant call graphs, mines fine-grained subgraphs, performs CoT-verified log merging, and applies knowledge-driven labeling to produce labeled datasets with execution contexts and anomalies. Empirical results on Hadoop and HDFS show substantial gains in event coverage ($97.48\%$ on average; $38$–$95\x$ more events than baselines) and modest but meaningful improvements in anomaly detection performance (average $1.8\%$ F1, up to $3.7\%$ for certain models). This work not only provides a high-quality benchmarking resource but also demonstrates a new paradigm for applying LLMs in software engineering workflows, enabling more realistic, semantically aware synthetic logs for training and evaluation.
Abstract
The scarcity of high-quality public log datasets has become a critical bottleneck in advancing log-based anomaly detection techniques. Current datasets exhibit three fundamental limitations: (1) incomplete event coverage, (2) artificial patterns introduced by static analysis-based generation frameworks, and (3) insufficient semantic awareness. To address these challenges, we present AnomalyGen, the first automated log synthesis framework specifically designed for anomaly detection. Our framework introduces a novel four-phase architecture that integrates enhanced program analysis with Chain-of-Thought reasoning (CoT reasoning), enabling iterative log generation and anomaly annotation without requiring physical system execution. Evaluations on Hadoop and HDFS distributed systems demonstrate that AnomalyGen achieves substantially broader log event coverage (38-95 times improvement over existing datasets) while producing more operationally realistic log sequences compared to static analysis-based approaches. When augmenting benchmark datasets with synthesized logs, we observe maximum F1-score improvements of 3.7% (average 1.8% improvement across three state-of-the-art anomaly detection models). This work not only establishes a high-quality benchmarking resource for automated log analysis but also pioneers a new paradigm for applying large language models (LLMs) in software engineering workflows.
