Table of Contents
Fetching ...

System Log Parsing with Large Language Models: A Review

Viktor Beck, Max Landauer, Markus Wurzenberger, Florian Skopik, Andreas Rauber

TL;DR

This paper systematically reviews the landscape of large language model–based log parsing, synthesizing 29 methods and benchmarking seven open-source approaches on public datasets. It presents a unified process pipeline, clarifies terminology across General Properties, Processing Steps, and Reproducibility, and evaluates both effectiveness and efficiency with a transparent benchmark. Key findings show that while LLM-based parsers like LogBatcher and LILAC can outperform traditional baselines in some settings, reproducibility and comparability remain major challenges due to heterogeneous datasets, metrics, and reporting. The work highlights effective techniques—ICL, RAG, caching, and template revision—and argues for standardized benchmarks and reporting practices to advance practical, reproducible log parsing with LLMs.

Abstract

Log data provides crucial insights for tasks like monitoring, root cause analysis, and anomaly detection. Due to the vast volume of logs, automated log parsing is essential to transform semi-structured log messages into structured representations. Recent advances in large language models (LLMs) have introduced the new research field of LLM-based log parsing. Despite promising results, there is no structured overview of the approaches in this relatively new research field with the earliest advances published in late 2023. This work systematically reviews 29 LLM-based log parsing methods. We benchmark seven of them on public datasets and critically assess their comparability and the reproducibility of their reported results. Our findings summarize the advances of this new research field, with insights on how to report results, which data sets, metrics and which terminology to use, and which inconsistencies to avoid, with code and results made publicly available for transparency.

System Log Parsing with Large Language Models: A Review

TL;DR

This paper systematically reviews the landscape of large language model–based log parsing, synthesizing 29 methods and benchmarking seven open-source approaches on public datasets. It presents a unified process pipeline, clarifies terminology across General Properties, Processing Steps, and Reproducibility, and evaluates both effectiveness and efficiency with a transparent benchmark. Key findings show that while LLM-based parsers like LogBatcher and LILAC can outperform traditional baselines in some settings, reproducibility and comparability remain major challenges due to heterogeneous datasets, metrics, and reporting. The work highlights effective techniques—ICL, RAG, caching, and template revision—and argues for standardized benchmarks and reporting practices to advance practical, reproducible log parsing with LLMs.

Abstract

Log data provides crucial insights for tasks like monitoring, root cause analysis, and anomaly detection. Due to the vast volume of logs, automated log parsing is essential to transform semi-structured log messages into structured representations. Recent advances in large language models (LLMs) have introduced the new research field of LLM-based log parsing. Despite promising results, there is no structured overview of the approaches in this relatively new research field with the earliest advances published in late 2023. This work systematically reviews 29 LLM-based log parsing methods. We benchmark seven of them on public datasets and critically assess their comparability and the reproducibility of their reported results. Our findings summarize the advances of this new research field, with insights on how to report results, which data sets, metrics and which terminology to use, and which inconsistencies to avoid, with code and results made publicly available for transparency.

Paper Structure

This paper contains 53 sections, 9 figures, 5 tables.

Figures (9)

  • Figure 1: A simple example of log parsing, from logging statement to log file to parsed log.
  • Figure 2: LLM-based parsing process pipeline. The dashed arrows and boxes represent optional components while the continuous ones represent essential components. The arrows pointing from the LLM to certain process elements describe where the LLM can be applied.
  • Figure 3: Performance of the baseline parsers on the corrected LogHub dataset including Audit.
  • Figure 4: Performance of the selected parsers on the corrected LogHub datasets including Audit
  • Figure 5: Averaged performance of LLM-based parsers for CodeLlama, GPT-3.5 and DeepSeek R1.
  • ...and 4 more figures