Table of Contents
Fetching ...

ECLIPSE: Semantic Entropy-LCS for Cross-Lingual Industrial Log Parsing

Wei Zhang, Xianfu Cheng, Yi Zhang, Jian Yang, Hongcheng Guo, Zhoujun Li, Xiaolin Yin, Xiangyuan Guan, Xu Shi, Liangfan Zheng, Bo Zhang

TL;DR

ECLIPSE addresses the challenge of industrial log parsing at scale and across languages by coupling an LLM-guided cross-lingual semantic understanding with a fast, vector-based retrieval (Faiss) and an information-theoretic matching mechanism (Entropy-LCS). It builds a dynamic template library keyed by semantic keywords, recalls candidate templates with k-nearest neighbors, and uses Entropy-LCS to select the best match while updating the dictionary in real time. The authors introduce ECLIPSE-Bench, a bilingual industrial log parsing benchmark with around $102$ million logs and 700 templates, enabling evaluation of cross-language scenarios. Across public LogHub data and ECLIPSE-Bench, ECLIPSE achieves state-of-the-art or near-state performance while delivering significant efficiency advantages at large template volumes, demonstrating robust applicability to real-world industrial settings.

Abstract

Log parsing, a vital task for interpreting the vast and complex data produced within software architectures faces significant challenges in the transition from academic benchmarks to the industrial domain. Existing log parsers, while highly effective on standardized public datasets, struggle to maintain performance and efficiency when confronted with the sheer scale and diversity of real-world industrial logs. These challenges are two-fold: 1) massive log templates: The performance and efficiency of most existing parsers will be significantly reduced when logs of growing quantities and different lengths; 2) Complex and changeable semantics: Traditional template-matching algorithms cannot accurately match the log templates of complicated industrial logs because they cannot utilize cross-language logs with similar semantics. To address these issues, we propose ECLIPSE, Enhanced Cross-Lingual Industrial log Parsing with Semantic Entropy-LCS, since cross-language logs can robustly parse industrial logs. On the one hand, it integrates two efficient data-driven template-matching algorithms and Faiss indexing. On the other hand, driven by the powerful semantic understanding ability of the Large Language Model (LLM), the semantics of log keywords were accurately extracted, and the retrieval space was effectively reduced. Notably, we launch a Chinese and English cross-platform industrial log parsing benchmark ECLIPSE- BENCH to evaluate the performance of mainstream parsers in industrial scenarios. Our experimental results across public benchmarks and ECLIPSE- BENCH underscore the superior performance and robustness of our proposed ECLIPSE. Notably, ECLIPSE both delivers state-of-the-art performance when compared to strong baselines and preserves a significant edge in processing efficiency.

ECLIPSE: Semantic Entropy-LCS for Cross-Lingual Industrial Log Parsing

TL;DR

ECLIPSE addresses the challenge of industrial log parsing at scale and across languages by coupling an LLM-guided cross-lingual semantic understanding with a fast, vector-based retrieval (Faiss) and an information-theoretic matching mechanism (Entropy-LCS). It builds a dynamic template library keyed by semantic keywords, recalls candidate templates with k-nearest neighbors, and uses Entropy-LCS to select the best match while updating the dictionary in real time. The authors introduce ECLIPSE-Bench, a bilingual industrial log parsing benchmark with around million logs and 700 templates, enabling evaluation of cross-language scenarios. Across public LogHub data and ECLIPSE-Bench, ECLIPSE achieves state-of-the-art or near-state performance while delivering significant efficiency advantages at large template volumes, demonstrating robust applicability to real-world industrial settings.

Abstract

Log parsing, a vital task for interpreting the vast and complex data produced within software architectures faces significant challenges in the transition from academic benchmarks to the industrial domain. Existing log parsers, while highly effective on standardized public datasets, struggle to maintain performance and efficiency when confronted with the sheer scale and diversity of real-world industrial logs. These challenges are two-fold: 1) massive log templates: The performance and efficiency of most existing parsers will be significantly reduced when logs of growing quantities and different lengths; 2) Complex and changeable semantics: Traditional template-matching algorithms cannot accurately match the log templates of complicated industrial logs because they cannot utilize cross-language logs with similar semantics. To address these issues, we propose ECLIPSE, Enhanced Cross-Lingual Industrial log Parsing with Semantic Entropy-LCS, since cross-language logs can robustly parse industrial logs. On the one hand, it integrates two efficient data-driven template-matching algorithms and Faiss indexing. On the other hand, driven by the powerful semantic understanding ability of the Large Language Model (LLM), the semantics of log keywords were accurately extracted, and the retrieval space was effectively reduced. Notably, we launch a Chinese and English cross-platform industrial log parsing benchmark ECLIPSE- BENCH to evaluate the performance of mainstream parsers in industrial scenarios. Our experimental results across public benchmarks and ECLIPSE- BENCH underscore the superior performance and robustness of our proposed ECLIPSE. Notably, ECLIPSE both delivers state-of-the-art performance when compared to strong baselines and preserves a significant edge in processing efficiency.
Paper Structure (30 sections, 1 equation, 10 figures, 8 tables)

This paper contains 30 sections, 1 equation, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Diagram of the log structuring process. The semi-structured logs are generated by log statement code and then parsed into the structured logs by algorithms. The log template represents the part of the keywords that are not changed during the log parsing process.
  • Figure 2: Example of various lengths of two logs from the same template.
  • Figure 3: Example of templates with similar tokens but different orders.
  • Figure 4: Example of three similar log templates in MySQL error log.
  • Figure 5: The framework of ECLIPSE. It is a five-stage two-path cross-lingual online log parser. Specifically, driven by LLM, ECLIPSE constructs a dynamic dictionary from semantic keywords to log templates. Faiss recalls K-nearest neighbor templates from the dynamic dictionary. Then, the Entropy-LCS method will identify candidates as log templates and log templates will update the dynamic dictionary in real time.
  • ...and 5 more figures