A Comparative Study on Large Language Models for Log Parsing

Merve Astekin; Max Hort; Leon Moonen

A Comparative Study on Large Language Models for Log Parsing

Merve Astekin, Max Hort, Leon Moonen

TL;DR

The paper benchmarks six large language models, including two paid and four free/open models, on log parsing across 1,354 LogHub templates using zero-shot and few-shot prompts. It evaluates parsing accuracy ($PA$), edit distance ($ED$), and longest common subsequence ($LCS$), plus normalized variants, to assess both correctness and template similarity. Key findings show CodeLlama delivering the best $PA$ and $ED$, while GPT-3.5 and Claude 2.1 perform best on $LCS$, revealing complementary strengths across models. The results demonstrate that free, code-specialized LLMs can compete with paid models for log parsing, with practical implications for cost, privacy, and accessibility, and underscore the importance of metric choice and output usability in evaluation. The work provides guidance for practitioners on selecting models and prompting strategies and suggests avenues for future research in hybrid model usage and prompt design, with data and artifacts made available for replication.

Abstract

Background: Log messages provide valuable information about the status of software systems. This information is provided in an unstructured fashion and automated approaches are applied to extract relevant parameters. To ease this process, log parsing can be applied, which transforms log messages into structured log templates. Recent advances in language models have led to several studies that apply ChatGPT to the task of log parsing with promising results. However, the performance of other state-of-the-art large language models (LLMs) on the log parsing task remains unclear. Aims: In this study, we investigate the current capability of state-of-the-art LLMs to perform log parsing. Method: We select six recent LLMs, including both paid proprietary (GPT-3.5, Claude 2.1) and four free-to-use open models, and compare their performance on system logs obtained from a selection of mature open-source projects. We design two different prompting approaches and apply the LLMs on 1, 354 log templates across 16 different projects. We evaluate their effectiveness, in the number of correctly identified templates, and the syntactic similarity between the generated templates and the ground truth. Results: We found that free-to-use models are able to compete with paid models, with CodeLlama extracting 10% more log templates correctly than GPT-3.5. Moreover, we provide qualitative insights into the usability of language models (e.g., how easy it is to use their responses). Conclusions: Our results reveal that some of the smaller, free-to-use LLMs can considerably assist log parsing compared to their paid proprietary competitors, especially code-specialized models.

A Comparative Study on Large Language Models for Log Parsing

TL;DR

), edit distance (

), and longest common subsequence (

), plus normalized variants, to assess both correctness and template similarity. Key findings show CodeLlama delivering the best

and

, while GPT-3.5 and Claude 2.1 perform best on

, revealing complementary strengths across models. The results demonstrate that free, code-specialized LLMs can compete with paid models for log parsing, with practical implications for cost, privacy, and accessibility, and underscore the importance of metric choice and output usability in evaluation. The work provides guidance for practitioners on selecting models and prompting strategies and suggests avenues for future research in hybrid model usage and prompt design, with data and artifacts made available for replication.

Abstract

Paper Structure (18 sections, 2 equations, 5 figures, 5 tables)

This paper contains 18 sections, 2 equations, 5 figures, 5 tables.

Introduction
Related Work
Log Parsing with Large Language Models
Evaluating Log Parsing
Methodology
Experimental Design
Research Questions
Dataset
Large Language Models
Evaluation Metrics
Threats to Validity
Results and Discussion
RQ1: Performance Comparison
RQ2: Impact of Metrics
RQ3: Usability of Responses
...and 3 more sections

Figures (5)

Figure 1: An example workflow of log parsing.
Figure 2: Overview of the experimental method.
Figure 3: The prompt prefix. Few-shot examples are highlighted in green. The log messages are inserted in <MSG>...</MSG> at the bottom of the prompt.
Figure 4: Ranking based on metrics.
Figure 5: Comparison of different aspects based on ranking with Pearson correlation coefficient. Colour and shape show the different LLMs.

A Comparative Study on Large Language Models for Log Parsing

TL;DR

Abstract

A Comparative Study on Large Language Models for Log Parsing

Authors

TL;DR

Abstract

Table of Contents

Figures (5)