Table of Contents
Fetching ...

Detecting Anomalies in Software Execution Logs with Siamese Network

Shayan Hashemi, Mika Mäntylä

TL;DR

The paper tackles anomaly detection in software execution logs, addressing unbalanced data and evolving logs. It introduces a Siamese-network-based embedding approach for log sequences, with a pair-generation strategy to efficiently train on both anomaly and non-anomaly data without heavy sampling. A classifier on top of the embedding yields state-of-the-art results on the HDFS dataset (F1 ~ 0.996), and a low-cost handcrafted variant plus a hybrid end-to-end model demonstrate practical engineering benefits. The work also provides unsupervised runtime log-evolution monitoring via a fitness score and visualization of the embedding space, enabling human oversight and applicable production use. Overall, the approach offers high accuracy, robustness to log evolution, and practical deployment options for scalable log anomaly detection.

Abstract

Logs are semi-structured text files that represent software's execution paths and states during its run-time. Therefore, detecting anomalies in software logs reflect anomalies in the software's execution path or state. So, it has become a notable concern in software engineering. We use LSTM like many prior works, and on top of LSTM, we propose a novel anomaly detection approach based on the Siamese network. This paper also provides an authentic validation of the approach on the Hadoop Distributed File System (HDFS) log dataset. To the best of our knowledge, the proposed approach outperforms other methods on the same dataset at the F1 score of 0.996, resulting in a new state-of-the-art performance on the dataset. Along with the primary method, we introduce a novel training pair generation algorithm that reduces generated training pairs by the factor of 3000 while maintaining the F1 score, merely a modest decay from 0.996 to 0.995. Additionally, we propose a hybrid model by combining the Siamese network with a traditional feedforward neural network to make end-to-end training possible, reducing engineering effort in setting up a deep-learning-based log anomaly detector. Furthermore, we examine our method's robustness to log evolutions by evaluating the model on synthetically evolved log sequences; we got the F1 score of 0.95 at the noise ratio of 20%. Finally, we dive deep into some of the side benefits of the Siamese network. Accordingly, we introduce a method of monitoring the evolutions of logs without label requirements at run-time. Additionally, we present a visualization technique that facilitates human administrations of log anomaly detection.

Detecting Anomalies in Software Execution Logs with Siamese Network

TL;DR

The paper tackles anomaly detection in software execution logs, addressing unbalanced data and evolving logs. It introduces a Siamese-network-based embedding approach for log sequences, with a pair-generation strategy to efficiently train on both anomaly and non-anomaly data without heavy sampling. A classifier on top of the embedding yields state-of-the-art results on the HDFS dataset (F1 ~ 0.996), and a low-cost handcrafted variant plus a hybrid end-to-end model demonstrate practical engineering benefits. The work also provides unsupervised runtime log-evolution monitoring via a fitness score and visualization of the embedding space, enabling human oversight and applicable production use. Overall, the approach offers high accuracy, robustness to log evolution, and practical deployment options for scalable log anomaly detection.

Abstract

Logs are semi-structured text files that represent software's execution paths and states during its run-time. Therefore, detecting anomalies in software logs reflect anomalies in the software's execution path or state. So, it has become a notable concern in software engineering. We use LSTM like many prior works, and on top of LSTM, we propose a novel anomaly detection approach based on the Siamese network. This paper also provides an authentic validation of the approach on the Hadoop Distributed File System (HDFS) log dataset. To the best of our knowledge, the proposed approach outperforms other methods on the same dataset at the F1 score of 0.996, resulting in a new state-of-the-art performance on the dataset. Along with the primary method, we introduce a novel training pair generation algorithm that reduces generated training pairs by the factor of 3000 while maintaining the F1 score, merely a modest decay from 0.996 to 0.995. Additionally, we propose a hybrid model by combining the Siamese network with a traditional feedforward neural network to make end-to-end training possible, reducing engineering effort in setting up a deep-learning-based log anomaly detector. Furthermore, we examine our method's robustness to log evolutions by evaluating the model on synthetically evolved log sequences; we got the F1 score of 0.95 at the noise ratio of 20%. Finally, we dive deep into some of the side benefits of the Siamese network. Accordingly, we introduce a method of monitoring the evolutions of logs without label requirements at run-time. Additionally, we present a visualization technique that facilitates human administrations of log anomaly detection.

Paper Structure

This paper contains 25 sections, 7 equations, 6 figures, 8 tables, 2 algorithms.

Figures (6)

  • Figure 1: System log anomaly detector's architecture as in logrobust.
  • Figure 2: The Siamese network's architecture.
  • Figure 3: Overall view of data splitting and experiments.
  • Figure 4: The modified Siamese network's architecture for end-to-end training.
  • Figure 5: The purge in fitness score as the noise ratio increase. It should be noted that positive scores are due to computing scores using probability density function.
  • ...and 1 more figures