LZBE: an LZ-style compressor supporting $O(\log n)$-time random access
Hiroki Shibata, Yuto Nakashima, Yutaro Yamaguchi, Shunsuke Inenaga
TL;DR
This work introduces LZ-Begin-End (LZBE), a restricted LZ-like factorization where each copy factor refers to a contiguous block of preceding factors, enabling efficient random access. It proves that any context-free grammar can be transformed into an equivalent LZBE factorization of the same size and shows that the greedy LZBE factorization can be asymptotically smaller than the smallest grammar in some cases, establishing a separation between LZBE and grammar-based compression. The authors present a linear-space data structure combining interval-biased search trees and symmetric centroid path decomposition to achieve $O(\log n)$-time random access for LZBE-compressed strings, and a linear-time algorithm to compute the greedy LZBE factorization using a suffix-tree-based framework with advanced ancestorship structures. Together, these results position LZBE as a powerful, query-friendly compression scheme that balances strong representation with practical access efficiency.
Abstract
An LZ-like factorization of a string divides it into factors, each being either a single character or a copy of a preceding substring. While grammar-based compression schemes support efficient random access with space linear in the compressed size, no comparable guarantees are known for general LZ-like factorizations. This limitation motivated restricted variants such as LZ-End [Kreft and Navarro, 2013] and height-bounded LZ (LZHB) [Bannai et al., 2024], which trade off some compression efficiency for faster access. In this paper, we introduce LZ-Begin-End (LZBE), a new LZ-like variant in which every copy factor must refer to a contiguous sequence of preceding factors. This structural restriction ensures that any context-free grammar can be transformed into an LZBE factorization of the same size. We further study the greedy LZBE factorization, which selects each copy factor to be as long as possible while processing the input from left to right, and show that it can be computed in linear time. Moreover, we exhibit a family of strings for which the greedy LZBE factorization is asymptotically smaller than the smallest grammar. These results demonstrate that the LZBE scheme is strictly more expressive than grammar-based compression in the worst case. To support fast queries, we propose a data structure for LZBE-compressed strings that permits O(log n)-time random access within space linear in the compressed size, where n is the length of the input string.
