Table of Contents
Fetching ...

Preprocessing is All You Need: Boosting the Performance of Log Parsers With a General Preprocessing Framework

Qiaolin Qin, Roozbeh Aghili, Heng Li, Ettore Merlo

TL;DR

This work reframes log parsing by elevating preprocessing from an ad hoc step to a core component that substantially boosts statistic-based parsers. By building a general, regex-driven preprocessing framework grounded in Loghub data, the authors demonstrate large gains in template-level metrics (e.g., Drain's FTA improving up to 108.9%) and competitive performance relative to semantic-based parsers. They illuminate the landscape of unmatched variable types, propose a prioritized set of generalizable patterns, and show that combining preprocessing with parsing yields robust improvements across log subgroups and template complexities. The results offer practical guidance for practitioners and provide replication resources to ease adoption and further research in preprocessing-driven log parsing.

Abstract

Log parsing has been a long-studied area in software engineering due to its importance in identifying dynamic variables and constructing log templates. Prior work has proposed many statistic-based log parsers (e.g., Drain), which are highly efficient; they, unfortunately, met the bottleneck of parsing performance in comparison to semantic-based log parsers, which require labeling and more computational resources. Meanwhile, we noticed that previous studies mainly focused on parsing and often treated preprocessing as an ad hoc step (e.g., masking numbers). However, we argue that both preprocessing and parsing are essential for log parsers to identify dynamic variables: the lack of understanding of preprocessing may hinder the optimal use of parsers and future research. Therefore, our work studied existing log preprocessing approaches based on Loghub, a popular log parsing benchmark. We developed a general preprocessing framework with our findings and evaluated its impact on existing parsers. Our experiments show that the preprocessing framework significantly boosts the performance of four state-of-the-art statistic-based parsers. Drain, the best statistic-based parser, obtained improvements across all four parsing metrics (e.g., F1 score of template accuracy, FTA, increased by 108.9%). Compared to semantic-based parsers, it achieved a 28.3% improvement in grouping accuracy (GA), 38.1% in FGA, and an 18.6% increase in FTA. Our work pioneers log preprocessing and provides a generalizable framework to enhance log parsing.

Preprocessing is All You Need: Boosting the Performance of Log Parsers With a General Preprocessing Framework

TL;DR

This work reframes log parsing by elevating preprocessing from an ad hoc step to a core component that substantially boosts statistic-based parsers. By building a general, regex-driven preprocessing framework grounded in Loghub data, the authors demonstrate large gains in template-level metrics (e.g., Drain's FTA improving up to 108.9%) and competitive performance relative to semantic-based parsers. They illuminate the landscape of unmatched variable types, propose a prioritized set of generalizable patterns, and show that combining preprocessing with parsing yields robust improvements across log subgroups and template complexities. The results offer practical guidance for practitioners and provide replication resources to ease adoption and further research in preprocessing-driven log parsing.

Abstract

Log parsing has been a long-studied area in software engineering due to its importance in identifying dynamic variables and constructing log templates. Prior work has proposed many statistic-based log parsers (e.g., Drain), which are highly efficient; they, unfortunately, met the bottleneck of parsing performance in comparison to semantic-based log parsers, which require labeling and more computational resources. Meanwhile, we noticed that previous studies mainly focused on parsing and often treated preprocessing as an ad hoc step (e.g., masking numbers). However, we argue that both preprocessing and parsing are essential for log parsers to identify dynamic variables: the lack of understanding of preprocessing may hinder the optimal use of parsers and future research. Therefore, our work studied existing log preprocessing approaches based on Loghub, a popular log parsing benchmark. We developed a general preprocessing framework with our findings and evaluated its impact on existing parsers. Our experiments show that the preprocessing framework significantly boosts the performance of four state-of-the-art statistic-based parsers. Drain, the best statistic-based parser, obtained improvements across all four parsing metrics (e.g., F1 score of template accuracy, FTA, increased by 108.9%). Compared to semantic-based parsers, it achieved a 28.3% improvement in grouping accuracy (GA), 38.1% in FGA, and an 18.6% increase in FTA. Our work pioneers log preprocessing and provides a generalizable framework to enhance log parsing.

Paper Structure

This paper contains 24 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The overview of our study design. D-K stands for domain knowledge.
  • Figure 2: The evaluation results of the four statistic-based parsers. The blue boxes indicate the parsers with the original preprocessing function, while the yellow boxes show the results of parsers with the new preprocessing framework. The red lines show the medians, and the green arrows indicate the means.
  • Figure 3: The number of hours required for parsing each log file using different parsers.
  • Figure 4: The average evaluation results of log parsers on logs with different frequencies (i.e., the most frequent 10% and the least frequent 10%.) The red dot lines illustrate the original results obtained with the previous preprocessing function.
  • Figure 5: The average evaluation results of log parsers on logs with different numbers of variables. The red dot lines illustrate the original results obtained with the previous preprocessing function.