AL-Bench: A Benchmark for Automatic Logging

Boyin Tan; Junjielong Xu; Zhouruixing Zhu; Pinjia He

AL-Bench: A Benchmark for Automatic Logging

Boyin Tan, Junjielong Xu, Zhouruixing Zhu, Pinjia He

TL;DR

This work introduces AL-Bench, a comprehensive benchmark for automatic logging that combines a large, real-world dataset with a dual static-dynamic evaluation framework. It shows that state-of-the-art logging tools exhibit substantial static accuracy drops and limited runtime log fidelity, with notable compilation failures, even on top-performing systems. By evaluating end-to-end tools across 10 popular projects and making data and code public, AL-Bench highlights critical gaps between reported results and real-world performance, and provides a foundation for fair, reproducible advancement in automatic logging. The benchmark emphasizes the need for context-aware generation that accounts for code execution paths and runtime log semantics to achieve practical logging quality.

Abstract

Logging, the practice of inserting log statements into source code, is critical for improving software reliability. Recently, language model-based techniques have been developed to automate log statement generation based on input code. While these tools show promising results in prior studies, the fairness of their results comparisons is not guaranteed due to the use of ad hoc datasets. In addition, existing evaluation approaches exclusively dependent on code similarity metrics fail to capture the impact of code diff on runtime logging behavior, as minor code modifications can induce program uncompilable and substantial discrepancies in log output semantics. To enhance the consistency and reproducibility of logging evaluation, we introduce AL-Bench, a comprehensive benchmark designed specifically for automatic logging tools. AL-Bench includes a large-scale, high-quality, diverse dataset collected from 10 widely recognized projects with varying logging requirements. Moreover, it introduces a novel dynamic evaluation methodology to provide a run-time perspective of logging quality in addition to the traditional static evaluation at source code level. Specifically, AL-Bench not only evaluates the similarity between the oracle and predicted log statements in source code, but also evaluates the difference between the log files printed by both log statements during runtime. AL-Bench reveals significant limitations in existing static evaluation, as all logging tools show average accuracy drops of 37.49%, 23.43%, and 15.80% in predicting log position, level, and message compared to their reported results. Furthermore, with dynamic evaluation, AL-Bench reveals that 20.1%-83.6% of these generated log statements are unable to compile. Moreover, the best-performing tool achieves only 21.32% cosine similarity between the log files of the oracle and generated log statements.

AL-Bench: A Benchmark for Automatic Logging

TL;DR

Abstract

AL-Bench: A Benchmark for Automatic Logging

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)