What Information Contributes to Log-based Anomaly Detection? Insights from a Configurable Transformer-Based Approach
Xingfang Wu, Heng Li, Foutse Khomh
TL;DR
This work investigates which information in log data most effectively signals anomalies by evaluating a configurable Transformer-based architecture that can integrate semantic content, sequential order, and temporal timestamps. The model processes log sequences of varying lengths, using Drain parsing and semantic embeddings, and explores two temporal encodings (RTEE and Time2Vec) alongside conventional positional encoding. Across four public datasets, the study finds that event occurrence and semantic information are the dominant signals, while sequential and temporal information often do not improve detection performance and can even hinder learning on some datasets. The results underscore the simplicity of current public datasets for log-based anomaly detection and call for richer datasets with diverse anomaly types to better evaluate advanced models and component contributions. The proposed flexible framework serves as a tool for analyzing new datasets and guiding progress in log analytics for anomaly detection.
Abstract
Log data are generated from logging statements in the source code, providing insights into the execution processes of software applications and systems. State-of-the-art log-based anomaly detection approaches typically leverage deep learning models to capture the semantic or sequential information in the log data and detect anomalous runtime behaviors. However, the impacts of these different types of information are not clear. In addition, most existing approaches ignore the timestamps in log data, which can potentially provide fine-grained sequential and temporal information. In this work, we propose a configurable Transformer-based anomaly detection model that can capture the semantic, sequential, and temporal information in the log data and allows us to configure the different types of information as the model's features. Additionally, we train and evaluate the proposed model using log sequences of different lengths, thus overcoming the constraint of existing methods that rely on fixed-length or time-windowed log sequences as inputs. With the proposed model, we conduct a series of experiments with different combinations of input features to evaluate the roles of different types of information in anomaly detection. The model can attain competitive and consistently stable performance compared to the baselines when presented with log sequences of varying lengths. The results indicate that the event occurrence information plays a key role in identifying anomalies, while the impact of the sequential and temporal information is not significant for anomaly detection on the studied public datasets. On the other hand, the findings also reveal the simplicity of the studied public datasets and highlight the importance of constructing new datasets that contain different types of anomalies to better evaluate the performance of anomaly detection models.
