Table of Contents
Fetching ...

On Sequence-to-Sequence Models for Automated Log Parsing

Adam Sorrenti, Andriy Miranskyy

TL;DR

This paper tackles automated log parsing by systematically comparing four sequence-to-sequence architectures (Transformer, Mamba state-space model, mono-/bi-directional LSTMs) across diverse log formats and data conditions. Using LogHub-2k and HTTPd-parse benchmarks and Levenshtein-based metrics, it demonstrates that Transformer's accuracy is generally highest, while Mamba provides competitive performance at substantially lower compute cost; character-level tokenization further improves parsing quality, and sequence length has limited practical impact for Transformer-based parsing. The study provides practical guidance on model choice under constraints of data diversity and computational resources, revealing a compute-accuracy trade-off where Mamba suits cost-constrained environments and Transformers offer robustness under distribution shifts. It also identifies future directions including subword tokenization and hybrid architectures to enhance generalization across real-world, heterogeneous log streams. The findings advance understanding of how representation, sequence length, and sample efficiency influence log parsing, informing researchers and practitioners about robust, efficient deployment strategies under distribution shift.

Abstract

Log parsing is a critical standard operating procedure in software systems, enabling monitoring, anomaly detection, and failure diagnosis. However, automated log parsing remains challenging due to heterogeneous log formats, distribution shifts between training and deployment data, and the brittleness of rule-based approaches. This study aims to systematically evaluate how sequence modelling architecture, representation choice, sequence length, and training data availability influence automated log parsing performance and computational cost. We conduct a controlled empirical study comparing four sequence modelling architectures: Transformer, Mamba state-space, monodirectional LSTM, and bidirectional LSTM models. In total, 396 models are trained across multiple dataset configurations and evaluated using relative Levenshtein edit distance with statistical significance testing. Transformer achieves the lowest mean relative edit distance (0.111), followed by Mamba (0.145), mono-LSTM (0.186), and bi-LSTM (0.265), where lower values are better. Mamba provides competitive accuracy with substantially lower computational cost. Character-level tokenization generally improves performance, sequence length has negligible practical impact on Transformer accuracy, and both Mamba and Transformer demonstrate stronger sample efficiency than recurrent models. Overall, Transformers reduce parsing error by 23.4%, while Mamba is a strong alternative under data or compute constraints. These results also clarify the roles of representation choice, sequence length, and sample efficiency, providing practical guidance for researchers and practitioners.

On Sequence-to-Sequence Models for Automated Log Parsing

TL;DR

This paper tackles automated log parsing by systematically comparing four sequence-to-sequence architectures (Transformer, Mamba state-space model, mono-/bi-directional LSTMs) across diverse log formats and data conditions. Using LogHub-2k and HTTPd-parse benchmarks and Levenshtein-based metrics, it demonstrates that Transformer's accuracy is generally highest, while Mamba provides competitive performance at substantially lower compute cost; character-level tokenization further improves parsing quality, and sequence length has limited practical impact for Transformer-based parsing. The study provides practical guidance on model choice under constraints of data diversity and computational resources, revealing a compute-accuracy trade-off where Mamba suits cost-constrained environments and Transformers offer robustness under distribution shifts. It also identifies future directions including subword tokenization and hybrid architectures to enhance generalization across real-world, heterogeneous log streams. The findings advance understanding of how representation, sequence length, and sample efficiency influence log parsing, informing researchers and practitioners about robust, efficient deployment strategies under distribution shift.

Abstract

Log parsing is a critical standard operating procedure in software systems, enabling monitoring, anomaly detection, and failure diagnosis. However, automated log parsing remains challenging due to heterogeneous log formats, distribution shifts between training and deployment data, and the brittleness of rule-based approaches. This study aims to systematically evaluate how sequence modelling architecture, representation choice, sequence length, and training data availability influence automated log parsing performance and computational cost. We conduct a controlled empirical study comparing four sequence modelling architectures: Transformer, Mamba state-space, monodirectional LSTM, and bidirectional LSTM models. In total, 396 models are trained across multiple dataset configurations and evaluated using relative Levenshtein edit distance with statistical significance testing. Transformer achieves the lowest mean relative edit distance (0.111), followed by Mamba (0.145), mono-LSTM (0.186), and bi-LSTM (0.265), where lower values are better. Mamba provides competitive accuracy with substantially lower computational cost. Character-level tokenization generally improves performance, sequence length has negligible practical impact on Transformer accuracy, and both Mamba and Transformer demonstrate stronger sample efficiency than recurrent models. Overall, Transformers reduce parsing error by 23.4%, while Mamba is a strong alternative under data or compute constraints. These results also clarify the roles of representation choice, sequence length, and sample efficiency, providing practical guidance for researchers and practitioners.
Paper Structure (67 sections, 14 figures, 64 tables)

This paper contains 67 sections, 14 figures, 64 tables.

Figures (14)

  • Figure 1: A sample sequence-to-sequence mapping of an Apache web server log. The first line contains the raw log message. The second line contains the field type to which a given input character is mapped.
  • Figure 2: Box-plots of the relative edit distance of $M_T$ by sequence length and validation dataset.
  • Figure 3: Box-plots of the relative edit distance by tokenization method and training dataset for each model architecture.
  • Figure 4: Box-plots of the relative edit distance of $M_T$ by sequence length.
  • Figure 5: Box-plots of the relative edit distance of $M_T$ by sequence length and training dataset.
  • ...and 9 more figures