Regular Expression Indexing for Log Analysis. Extended Version
Ling Zhang, Shaleen Deep, Jignesh M. Patel, Karthikeyan Sankaralingam
TL;DR
The paper tackles the high cost of regex queries over large log datasets by introducing REI, a lightweight, $n$-gram–based bit-vector index that is stored alongside log lines and filtered via a negative-index principle before invoking a regex engine. REI selects a compact set of $k$ bigrams from the query workload (with $n=2$) and builds a per-line $k$-bit vector, achieving substantial speedups with modest space overhead. It also addresses unknown workloads by using frequent English bigrams as index keys and analyzes trade-offs across $n$-gram type, $k$, and index granularity, demonstrating robust performance improvements across multiple real-world datasets. The evaluation shows REI outperforming inverted-index and signature-file baselines, with up to $14\times$ speedup and minimal extra space, and provides practical guidance for parameter tuning and deployment in log-analysis pipelines. The work lays a foundation for scalable, regex-accelerated log analytics and points to future enhancements in dynamic updates, compression, and distributed implementations.
Abstract
In this paper, we present the design and architecture of REI, a novel system for indexing log data for regular expression queries. Our main contribution is an $n$-gram-based indexing strategy and an efficient storage mechanism that results in a speedup of up to 14x compared to state-of-the-art regex processing engines that do not use indexing, using only 2.1% of extra space. We perform a detailed study that analyzes the space usage of the index and the improvement in workload execution time, uncovering interesting insights. Specifically, we show that even an optimized implementation of strategies such as inverted indexing, which are widely used in text processing libraries, may lead to suboptimal performance for regex indexing on log analysis tasks. Overall, the REI approach presented in this paper provides a significant boost when evaluating regular expression queries on log data. REI is also modular and can work with existing regular expression packages, making it easy to deploy in a variety of settings. The code of REI is available at https://github.com/mush-zhang/REI-Regular-Expression-Indexing.
