Table of Contents
Fetching ...

AL-Bench: A Benchmark for Automatic Logging

Boyin Tan, Junjielong Xu, Zhouruixing Zhu, Pinjia He

TL;DR

This work introduces AL-Bench, a comprehensive benchmark for automatic logging that combines a large, real-world dataset with a dual static-dynamic evaluation framework. It shows that state-of-the-art logging tools exhibit substantial static accuracy drops and limited runtime log fidelity, with notable compilation failures, even on top-performing systems. By evaluating end-to-end tools across 10 popular projects and making data and code public, AL-Bench highlights critical gaps between reported results and real-world performance, and provides a foundation for fair, reproducible advancement in automatic logging. The benchmark emphasizes the need for context-aware generation that accounts for code execution paths and runtime log semantics to achieve practical logging quality.

Abstract

Logging, the practice of inserting log statements into source code, is critical for improving software reliability. Recently, language model-based techniques have been developed to automate log statement generation based on input code. While these tools show promising results in prior studies, the fairness of their results comparisons is not guaranteed due to the use of ad hoc datasets. In addition, existing evaluation approaches exclusively dependent on code similarity metrics fail to capture the impact of code diff on runtime logging behavior, as minor code modifications can induce program uncompilable and substantial discrepancies in log output semantics. To enhance the consistency and reproducibility of logging evaluation, we introduce AL-Bench, a comprehensive benchmark designed specifically for automatic logging tools. AL-Bench includes a large-scale, high-quality, diverse dataset collected from 10 widely recognized projects with varying logging requirements. Moreover, it introduces a novel dynamic evaluation methodology to provide a run-time perspective of logging quality in addition to the traditional static evaluation at source code level. Specifically, AL-Bench not only evaluates the similarity between the oracle and predicted log statements in source code, but also evaluates the difference between the log files printed by both log statements during runtime. AL-Bench reveals significant limitations in existing static evaluation, as all logging tools show average accuracy drops of 37.49%, 23.43%, and 15.80% in predicting log position, level, and message compared to their reported results. Furthermore, with dynamic evaluation, AL-Bench reveals that 20.1%-83.6% of these generated log statements are unable to compile. Moreover, the best-performing tool achieves only 21.32% cosine similarity between the log files of the oracle and generated log statements.

AL-Bench: A Benchmark for Automatic Logging

TL;DR

This work introduces AL-Bench, a comprehensive benchmark for automatic logging that combines a large, real-world dataset with a dual static-dynamic evaluation framework. It shows that state-of-the-art logging tools exhibit substantial static accuracy drops and limited runtime log fidelity, with notable compilation failures, even on top-performing systems. By evaluating end-to-end tools across 10 popular projects and making data and code public, AL-Bench highlights critical gaps between reported results and real-world performance, and provides a foundation for fair, reproducible advancement in automatic logging. The benchmark emphasizes the need for context-aware generation that accounts for code execution paths and runtime log semantics to achieve practical logging quality.

Abstract

Logging, the practice of inserting log statements into source code, is critical for improving software reliability. Recently, language model-based techniques have been developed to automate log statement generation based on input code. While these tools show promising results in prior studies, the fairness of their results comparisons is not guaranteed due to the use of ad hoc datasets. In addition, existing evaluation approaches exclusively dependent on code similarity metrics fail to capture the impact of code diff on runtime logging behavior, as minor code modifications can induce program uncompilable and substantial discrepancies in log output semantics. To enhance the consistency and reproducibility of logging evaluation, we introduce AL-Bench, a comprehensive benchmark designed specifically for automatic logging tools. AL-Bench includes a large-scale, high-quality, diverse dataset collected from 10 widely recognized projects with varying logging requirements. Moreover, it introduces a novel dynamic evaluation methodology to provide a run-time perspective of logging quality in addition to the traditional static evaluation at source code level. Specifically, AL-Bench not only evaluates the similarity between the oracle and predicted log statements in source code, but also evaluates the difference between the log files printed by both log statements during runtime. AL-Bench reveals significant limitations in existing static evaluation, as all logging tools show average accuracy drops of 37.49%, 23.43%, and 15.80% in predicting log position, level, and message compared to their reported results. Furthermore, with dynamic evaluation, AL-Bench reveals that 20.1%-83.6% of these generated log statements are unable to compile. Moreover, the best-performing tool achieves only 21.32% cosine similarity between the log files of the oracle and generated log statements.

Paper Structure

This paper contains 20 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: An example of logging statement generation. Logging statement generation can be separated as three parts: determining the position, selecting the level, and specifying the message.
  • Figure 2: Bad patterns in existing datasets (using instances from LANCE dataset as examples): (1) Duplicated Variable records the same information multiple times, costing redundant overhead for both printing and recording logs in runtime. (2) Empty String provides insufficient context, hindering the effectiveness of debugging through printed logs. (3) Unpredictable Character contains numerous meaningless special tokens, making both the logging methods hard to predict and their printed logs difficult to parse for downstream analysis gojko2006logging. (4) Wrong Verbosity Level often misleads developers for debugging and fault localization Chen2017CharacterizingAD. (5) Explicit Cast couple logs to variable type casting, might cause runtime type conversion errors and system crash Chen2017CharacterizingAD.
  • Figure 3: $Code_{w/o\ \ LogStmt}$ indicates the code without one log statement, the $LogPos$ means the position of this log statement, $LogStmt$ is the log statement itself. Those three structure the evaluation tuple.
  • Figure 4: The general workflow of dynamic evaluation. First, compile the project and run the unit test to obtain ground truth logs. Then, replace log statements with predictions, re-run the test to generate new logs, and finally analyze the results.
  • Figure 5: Performance of logging tools among different projects. The performance of each tool varies considerably across different projects and the trends of all methods across all projects generally remain consistent.