Table of Contents
Fetching ...

Unlocking the Power of Numbers: Log Compression via Numeric Token Parsing

Siyu Yu, Yifan Wu, Ying Li, Pinjia He

TL;DR

Denum addresses the inefficiencies of parser-based log compressors by focusing on the numeric content that dominates log data. It introduces a two-module design—Numeric Token Parsing to extract and encode numeric tokens by arithmetic relationships, and String Processing to compress the remaining content—then feeds the results into a general-purpose compressor. Across 16 benchmark datasets, Denum achieves substantially higher average compression ratios (CR) and faster compression speeds (CS) than baselines, and its numeric parsing module can meaningfully boost existing compressors with notable CR and CS gains. The work emphasizes practical impact by offering modular components that can be integrated with other tools, with public implementations available for broader adoption.

Abstract

Parser-based log compressors have been widely explored in recent years because the explosive growth of log volumes makes the compression performance of general-purpose compressors unsatisfactory. These parser-based compressors preprocess logs by grouping the logs based on the parsing result and then feed the preprocessed files into a general-purpose compressor. However, parser-based compressors have their limitations. First, the goals of parsing and compression are misaligned, so the inherent characteristics of logs were not fully utilized. In addition, the performance of parser-based compressors depends on the sample logs and thus it is very unstable. Moreover, parser-based compressors often incur a long processing time. To address these limitations, we propose Denum, a simple, general log compressor with high compression ratio and speed. The core insight is that a majority of the tokens in logs are numeric tokens (i.e. pure numbers, tokens with only numbers and special characters, and numeric variables) and effective compression of them is critical for log compression. Specifically, Denum contains a Numeric Token Parsing module, which extracts all numeric tokens and applies tailored processing methods (e.g. store the differences of incremental numbers like timestamps), and a String Processing module, which processes the remaining log content without numbers. The processed files of the two modules are then fed as input to a general-purpose compressor and it outputs the final compression results. Denum has been evaluated on 16 log datasets and it achieves an 8.7%-434.7% higher average compression ratio and 2.6x-37.7x faster average compression speed (i.e. 26.2MB/S) compared to the baselines. Moreover, integrating Denum's Numeric Token Parsing into existing log compressors can provide an 11.8% improvement in their average compression ratio and achieve 37% faster average compression speed.

Unlocking the Power of Numbers: Log Compression via Numeric Token Parsing

TL;DR

Denum addresses the inefficiencies of parser-based log compressors by focusing on the numeric content that dominates log data. It introduces a two-module design—Numeric Token Parsing to extract and encode numeric tokens by arithmetic relationships, and String Processing to compress the remaining content—then feeds the results into a general-purpose compressor. Across 16 benchmark datasets, Denum achieves substantially higher average compression ratios (CR) and faster compression speeds (CS) than baselines, and its numeric parsing module can meaningfully boost existing compressors with notable CR and CS gains. The work emphasizes practical impact by offering modular components that can be integrated with other tools, with public implementations available for broader adoption.

Abstract

Parser-based log compressors have been widely explored in recent years because the explosive growth of log volumes makes the compression performance of general-purpose compressors unsatisfactory. These parser-based compressors preprocess logs by grouping the logs based on the parsing result and then feed the preprocessed files into a general-purpose compressor. However, parser-based compressors have their limitations. First, the goals of parsing and compression are misaligned, so the inherent characteristics of logs were not fully utilized. In addition, the performance of parser-based compressors depends on the sample logs and thus it is very unstable. Moreover, parser-based compressors often incur a long processing time. To address these limitations, we propose Denum, a simple, general log compressor with high compression ratio and speed. The core insight is that a majority of the tokens in logs are numeric tokens (i.e. pure numbers, tokens with only numbers and special characters, and numeric variables) and effective compression of them is critical for log compression. Specifically, Denum contains a Numeric Token Parsing module, which extracts all numeric tokens and applies tailored processing methods (e.g. store the differences of incremental numbers like timestamps), and a String Processing module, which processes the remaining log content without numbers. The processed files of the two modules are then fed as input to a general-purpose compressor and it outputs the final compression results. Denum has been evaluated on 16 log datasets and it achieves an 8.7%-434.7% higher average compression ratio and 2.6x-37.7x faster average compression speed (i.e. 26.2MB/S) compared to the baselines. Moreover, integrating Denum's Numeric Token Parsing into existing log compressors can provide an 11.8% improvement in their average compression ratio and achieve 37% faster average compression speed.
Paper Structure (18 sections, 2 equations, 8 figures, 6 tables)

This paper contains 18 sections, 2 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Increasing numbers across different Templates. The increasing numbers are in red and the templates these numbers belong to are in purple.
  • Figure 2: The general steps of parser-based log compressor
  • Figure 3: The overview of Denum
  • Figure 4: Example of logs containing various types of numeric tokens
  • Figure 5: Tagging numeric tokens
  • ...and 3 more figures