Table of Contents
Fetching ...

REFLEX: Reference-Free Evaluation of Log Summarization via Large Language Model Judgment

Priyanka Mudgal

TL;DR

REFLEX addresses the challenge of evaluating log summarization without gold references by using large language models as reference-free judges. It introduces a modular pipeline with preprocessing, LLM-based summarization, and an embedding-based Evaluation Engine that computes semantic similarity, enabling stable, interpretable, and discriminative assessments across diverse log datasets. Empirical results show REFLEX correlates with human preferences and captures quality dimensions such as relevance, informativeness, and fluency, often outperforming surface-based metrics like ROUGE in the log domain. The framework offers a scalable, reproducible protocol suitable for real-world deployment and research, with potential extensions to hybrid parsing, temporal context, and real-time monitoring scenarios.

Abstract

Evaluating log summarization systems is challenging due to the lack of high-quality reference summaries and the limitations of existing metrics like ROUGE and BLEU, which depend on surface-level lexical overlap. We introduce REFLEX, a reference-free evaluation metric for log summarization based on large language model (LLM) judgment. REFLEX uses LLMs as zero-shot evaluators to assess summary quality along dimensions such as relevance, informativeness, and coherence, without requiring gold-standard references or human annotations. We show that REFLEX produces stable, interpretable, and fine-grained evaluations across multiple log summarization dataset, and more effectively distinguishes model outputs than traditional metrics. REFLEX provides a scalable alternative for evaluating log summaries in real-world settings where reference data is scarce or unavailable.

REFLEX: Reference-Free Evaluation of Log Summarization via Large Language Model Judgment

TL;DR

REFLEX addresses the challenge of evaluating log summarization without gold references by using large language models as reference-free judges. It introduces a modular pipeline with preprocessing, LLM-based summarization, and an embedding-based Evaluation Engine that computes semantic similarity, enabling stable, interpretable, and discriminative assessments across diverse log datasets. Empirical results show REFLEX correlates with human preferences and captures quality dimensions such as relevance, informativeness, and fluency, often outperforming surface-based metrics like ROUGE in the log domain. The framework offers a scalable, reproducible protocol suitable for real-world deployment and research, with potential extensions to hybrid parsing, temporal context, and real-time monitoring scenarios.

Abstract

Evaluating log summarization systems is challenging due to the lack of high-quality reference summaries and the limitations of existing metrics like ROUGE and BLEU, which depend on surface-level lexical overlap. We introduce REFLEX, a reference-free evaluation metric for log summarization based on large language model (LLM) judgment. REFLEX uses LLMs as zero-shot evaluators to assess summary quality along dimensions such as relevance, informativeness, and coherence, without requiring gold-standard references or human annotations. We show that REFLEX produces stable, interpretable, and fine-grained evaluations across multiple log summarization dataset, and more effectively distinguishes model outputs than traditional metrics. REFLEX provides a scalable alternative for evaluating log summaries in real-world settings where reference data is scarce or unavailable.

Paper Structure

This paper contains 24 sections, 5 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: REFLEX uses LLM to generate summaries from logs and evaluates them automatically, without requiring human-written references, similar to how experts judge log readability.
  • Figure 2: HDFS block update log messages and provided summary 10017337.
  • Figure 3: Comparison of similarity and ROUGE scores across log types for three REFLEX variants.