LogPurge: Log Data Purification for Anomaly Detection via Rule-Enhanced Filtering
Shenglin Zhang, Ziang Chen, Zijing Que, Yilun Liu, Yongqian Sun, Sicheng Wei, Dan Pei, Hailin Li
TL;DR
LogPurge addresses the key challenge of obtaining high-quality, contamination-free log data for unsupervised anomaly detection by introducing a rule-enhanced, cost-aware purification framework. It combines a two-stage purification—region-construction with LLM-based contamination judgment and a divide-and-conquer regional subdivision—to progressively eliminate anomalies while preserving normal log diversity. The approach uses domain rules mined from validation feedback to guide LLM reasoning, and a recursive subdivision step to handle residual contamination, achieving high Subset Purity and Clean Retention Rates and substantially boosting downstream lightweight detector performance. Empirical results on two public datasets and a real industrial dataset show that LogPurge removes up to 98.74% of anomalies while retaining 82.39% of normal samples, yielding large F1-score gains for lightweight detectors and orders-of-magnitude faster inference than direct LLM-based anomaly detection. This work demonstrates a practical, scalable path to semantically informed log data purification that enhances reliability and efficiency of anomaly detection in real-world systems.
Abstract
Log anomaly detection, which is critical for identifying system failures and preempting security breaches, detects irregular patterns within large volumes of log data, and impacts domains such as service reliability, performance optimization, and database log analysis. Modern log anomaly detection methods rely on training deep learning models on clean, anomaly-free log sequences. However, obtaining such clean log data requires costly and tedious human labeling, and existing automatic cleaning methods fail to fully integrate the specific characteristics and actual semantics of logs in their purification process. In this paper, we propose a cost-aware, rule-enhanced purification framework, LogPurge, that automatically selects a sufficient subset of normal log sequences from contamination log sequences to train a anomaly detection model. Our approach involves a two-stage filtering algorithm: In the first stage, we use a large language model (LLM) to remove clustered anomalous patterns and enhance system rules to improve LLM's understanding of system logs; in the second stage, we utilize a divide-and-conquer strategy that decomposes the remaining contaminated regions into smaller subproblems, allowing each to be effectively purified through the first stage procedure. Our experiments, conducted on two public datasets and one industrial dataset, show that our method significantly removes an average of 98.74% of anomalies while retaining 82.39% of normal samples. Compared to the latest unsupervised log sample selection algorithms, our method achieves F-1 score improvements of 35.7% and 84.11% on the public datasets, and an impressive 149.72% F-1 improvement on the private dataset, demonstrating the effectiveness of our approach.
