Computing the LZ-End parsing: Easy to implement and practically efficient
Patrick Dinklage
TL;DR
This work targets practical computation of the LZ-End parsing, balancing compression quality with random access and implementability. It refines the $O(n\log\log n)$ parsing algorithm of Kempa & Kosolobov by introducing lazy evaluation and an associative, compact index that removes dependence on the suffix array after initialization, broadening practicality and reducing memory. The approach yields a simple, complete implementation with a full listing and demonstrates favorable parsing speed against the state of the art on the Pizza&Chili corpus, while noting tradeoffs for highly repetitive inputs. Overall, the method provides a near-linear, memory-efficient parser for LZ-End that is well-suited for streaming and indexing tasks in real-world data.
Abstract
The LZ-End parsing [Kreft & Navarro, 2011] of an input string yields compression competitive with the popular Lempel-Ziv 77 scheme, but also allows for efficient random access. Kempa and Kosolobov showed that the parsing can be computed in time and space linear in the input length [Kempa & Kosolobov, 2017], however, the corresponding algorithm is hardly practical. We put the spotlight on their suboptimal algorithm that computes the parsing in time $\mathcal{O}(n \lg\lg n)$. It requires a comparatively small toolset and is therefore easy to implement, but at the same time very efficient in practice. We give a detailed and simplified description with a full listing that incorporates undocumented tricks from the original implementation, but also uses lazy evaluation to reduce the workload in practice and requires less working memory by removing a level of indirection. We legitimize our algorithm in a brief benchmark, obtaining the parsing faster than the state of the art.
