Table of Contents
Fetching ...

Regular Expression Indexing for Log Analysis. Extended Version

Ling Zhang, Shaleen Deep, Jignesh M. Patel, Karthikeyan Sankaralingam

TL;DR

The paper tackles the high cost of regex queries over large log datasets by introducing REI, a lightweight, $n$-gram–based bit-vector index that is stored alongside log lines and filtered via a negative-index principle before invoking a regex engine. REI selects a compact set of $k$ bigrams from the query workload (with $n=2$) and builds a per-line $k$-bit vector, achieving substantial speedups with modest space overhead. It also addresses unknown workloads by using frequent English bigrams as index keys and analyzes trade-offs across $n$-gram type, $k$, and index granularity, demonstrating robust performance improvements across multiple real-world datasets. The evaluation shows REI outperforming inverted-index and signature-file baselines, with up to $14\times$ speedup and minimal extra space, and provides practical guidance for parameter tuning and deployment in log-analysis pipelines. The work lays a foundation for scalable, regex-accelerated log analytics and points to future enhancements in dynamic updates, compression, and distributed implementations.

Abstract

In this paper, we present the design and architecture of REI, a novel system for indexing log data for regular expression queries. Our main contribution is an $n$-gram-based indexing strategy and an efficient storage mechanism that results in a speedup of up to 14x compared to state-of-the-art regex processing engines that do not use indexing, using only 2.1% of extra space. We perform a detailed study that analyzes the space usage of the index and the improvement in workload execution time, uncovering interesting insights. Specifically, we show that even an optimized implementation of strategies such as inverted indexing, which are widely used in text processing libraries, may lead to suboptimal performance for regex indexing on log analysis tasks. Overall, the REI approach presented in this paper provides a significant boost when evaluating regular expression queries on log data. REI is also modular and can work with existing regular expression packages, making it easy to deploy in a variety of settings. The code of REI is available at https://github.com/mush-zhang/REI-Regular-Expression-Indexing.

Regular Expression Indexing for Log Analysis. Extended Version

TL;DR

The paper tackles the high cost of regex queries over large log datasets by introducing REI, a lightweight, -gram–based bit-vector index that is stored alongside log lines and filtered via a negative-index principle before invoking a regex engine. REI selects a compact set of bigrams from the query workload (with ) and builds a per-line -bit vector, achieving substantial speedups with modest space overhead. It also addresses unknown workloads by using frequent English bigrams as index keys and analyzes trade-offs across -gram type, , and index granularity, demonstrating robust performance improvements across multiple real-world datasets. The evaluation shows REI outperforming inverted-index and signature-file baselines, with up to speedup and minimal extra space, and provides practical guidance for parameter tuning and deployment in log-analysis pipelines. The work lays a foundation for scalable, regex-accelerated log analytics and points to future enhancements in dynamic updates, compression, and distributed implementations.

Abstract

In this paper, we present the design and architecture of REI, a novel system for indexing log data for regular expression queries. Our main contribution is an -gram-based indexing strategy and an efficient storage mechanism that results in a speedup of up to 14x compared to state-of-the-art regex processing engines that do not use indexing, using only 2.1% of extra space. We perform a detailed study that analyzes the space usage of the index and the improvement in workload execution time, uncovering interesting insights. Specifically, we show that even an optimized implementation of strategies such as inverted indexing, which are widely used in text processing libraries, may lead to suboptimal performance for regex indexing on log analysis tasks. Overall, the REI approach presented in this paper provides a significant boost when evaluating regular expression queries on log data. REI is also modular and can work with existing regular expression packages, making it easy to deploy in a variety of settings. The code of REI is available at https://github.com/mush-zhang/REI-Regular-Expression-Indexing.

Paper Structure

This paper contains 37 sections, 1 theorem, 7 equations, 13 figures, 5 tables, 2 algorithms.

Key Result

lemma 1

Algorithm algo:index_query correctly generates the output of a regex for the given log $L$.

Figures (13)

  • Figure 1: Compared to the state-of-the-art regex matching framework BLARE, REI improves the performance by 14$\mathbf{\times}$ on a production workload, with 2.1% extra space for the index.
  • Figure 2: Varying the number of bigrams, compare the set of bigrams selected by the three methods and the matching time applying their resulting indices.
  • Figure 3: Query overview using a bit-vector index with $k=4$.
  • Figure 4: Comparing the impact of different types of $n$-grams on index construction time and matching time on indexes with the top 64 most frequent $n$-grams in workload queries.
  • Figure 5: Comparing the impact of different numbers of $n$-grams on index construction time of the indices. Uses top-$k$ most frequent $n$-grams in workload queries.
  • ...and 8 more figures

Theorems & Definitions (1)

  • lemma 1