Table of Contents
Fetching ...

Log Parsing Evaluation in the Era of Modern Software Systems

Stefan Petrescu, Floris den Hengst, Alexandru Uta, Jan S. Rellermeyer

TL;DR

This work scrutinizes how log parsing is evaluated in academia, revealing that standard parsing accuracy metrics misrepresent real-template extraction and downstream usefulness, especially on heterogeneous, industry-like data. It shows that robustness depends heavily on dataset characteristics and that public benchmarks fail to capture production-scale complexity, including microservices and containers. To address this, the authors introduce Logchimera, a tool that estimates log heterogeneity and synthetically increases it via mixing and fuzzing, producing industry-resembling datasets and enabling realistic benchmarking. They provide two metrics, log template accuracy and edit-distance, along with public datasets and open-source tooling to bridge the gap between research and production, aiming to improve the reliability of automated log analysis in real-world systems.

Abstract

Due to the complexity and size of modern software systems, the amount of logs generated is tremendous. Hence, it is infeasible to manually investigate these data in a reasonable time, thereby requiring automating log analysis to derive insights about the functioning of the systems. Motivated by an industry use-case, we zoom-in on one integral part of automated log analysis, log parsing, which is the prerequisite to deriving any insights from logs. Our investigation reveals problematic aspects within the log parsing field, particularly its inefficiency in handling heterogeneous real-world logs. We show this by assessing the 14 most-recognized log parsing approaches in the literature using (i) nine publicly available datasets, (ii) one dataset comprised of combined publicly available data, and (iii) one dataset generated within the infrastructure of a large bank. Subsequently, toward improving log parsing robustness in real-world production scenarios, we propose a tool, Logchimera, that enables estimating log parsing performance in industry contexts through generating synthetic log data that resemble industry logs. Our contributions serve as a foundation to consolidate past research efforts, facilitate future research advancements, and establish a strong link between research and industry log parsing.

Log Parsing Evaluation in the Era of Modern Software Systems

TL;DR

This work scrutinizes how log parsing is evaluated in academia, revealing that standard parsing accuracy metrics misrepresent real-template extraction and downstream usefulness, especially on heterogeneous, industry-like data. It shows that robustness depends heavily on dataset characteristics and that public benchmarks fail to capture production-scale complexity, including microservices and containers. To address this, the authors introduce Logchimera, a tool that estimates log heterogeneity and synthetically increases it via mixing and fuzzing, producing industry-resembling datasets and enabling realistic benchmarking. They provide two metrics, log template accuracy and edit-distance, along with public datasets and open-source tooling to bridge the gap between research and production, aiming to improve the reliability of automated log analysis in real-world systems.

Abstract

Due to the complexity and size of modern software systems, the amount of logs generated is tremendous. Hence, it is infeasible to manually investigate these data in a reasonable time, thereby requiring automating log analysis to derive insights about the functioning of the systems. Motivated by an industry use-case, we zoom-in on one integral part of automated log analysis, log parsing, which is the prerequisite to deriving any insights from logs. Our investigation reveals problematic aspects within the log parsing field, particularly its inefficiency in handling heterogeneous real-world logs. We show this by assessing the 14 most-recognized log parsing approaches in the literature using (i) nine publicly available datasets, (ii) one dataset comprised of combined publicly available data, and (iii) one dataset generated within the infrastructure of a large bank. Subsequently, toward improving log parsing robustness in real-world production scenarios, we propose a tool, Logchimera, that enables estimating log parsing performance in industry contexts through generating synthetic log data that resemble industry logs. Our contributions serve as a foundation to consolidate past research efforts, facilitate future research advancements, and establish a strong link between research and industry log parsing.
Paper Structure (14 sections, 1 equation, 4 figures, 8 tables)

This paper contains 14 sections, 1 equation, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Example of log parsing process. The discovered constant parts represent the log template, whereas variables are replaced by generic tokens: '<*>'.
  • Figure 2: (a) Clustering of offline approaches. (b) Clustering of online approaches.
  • Figure 3: Difference between parsing accuracy (considered by He et al. 10.1145/3460345) and log template accuracy. Even though Template 2 and Template 3 do not match Gdth Label 2 and Gdth Label 3 respectively, under the most prevalent metric in the field (parsing accuracy), the templates are considered to be extracted correctly, with an accuracy of 100%.
  • Figure 4: (a) Unique number of characters versus unique of words. (b) Unique number of log lines' character length versus unique number of words. (c) Unique number of log lines' character length versus unique number of characters.