Table of Contents
Fetching ...

LZBE: an LZ-style compressor supporting $O(\log n)$-time random access

Hiroki Shibata, Yuto Nakashima, Yutaro Yamaguchi, Shunsuke Inenaga

TL;DR

This work introduces LZ-Begin-End (LZBE), a restricted LZ-like factorization where each copy factor refers to a contiguous block of preceding factors, enabling efficient random access. It proves that any context-free grammar can be transformed into an equivalent LZBE factorization of the same size and shows that the greedy LZBE factorization can be asymptotically smaller than the smallest grammar in some cases, establishing a separation between LZBE and grammar-based compression. The authors present a linear-space data structure combining interval-biased search trees and symmetric centroid path decomposition to achieve $O(\log n)$-time random access for LZBE-compressed strings, and a linear-time algorithm to compute the greedy LZBE factorization using a suffix-tree-based framework with advanced ancestorship structures. Together, these results position LZBE as a powerful, query-friendly compression scheme that balances strong representation with practical access efficiency.

Abstract

An LZ-like factorization of a string divides it into factors, each being either a single character or a copy of a preceding substring. While grammar-based compression schemes support efficient random access with space linear in the compressed size, no comparable guarantees are known for general LZ-like factorizations. This limitation motivated restricted variants such as LZ-End [Kreft and Navarro, 2013] and height-bounded LZ (LZHB) [Bannai et al., 2024], which trade off some compression efficiency for faster access. In this paper, we introduce LZ-Begin-End (LZBE), a new LZ-like variant in which every copy factor must refer to a contiguous sequence of preceding factors. This structural restriction ensures that any context-free grammar can be transformed into an LZBE factorization of the same size. We further study the greedy LZBE factorization, which selects each copy factor to be as long as possible while processing the input from left to right, and show that it can be computed in linear time. Moreover, we exhibit a family of strings for which the greedy LZBE factorization is asymptotically smaller than the smallest grammar. These results demonstrate that the LZBE scheme is strictly more expressive than grammar-based compression in the worst case. To support fast queries, we propose a data structure for LZBE-compressed strings that permits O(log n)-time random access within space linear in the compressed size, where n is the length of the input string.

LZBE: an LZ-style compressor supporting $O(\log n)$-time random access

TL;DR

This work introduces LZ-Begin-End (LZBE), a restricted LZ-like factorization where each copy factor refers to a contiguous block of preceding factors, enabling efficient random access. It proves that any context-free grammar can be transformed into an equivalent LZBE factorization of the same size and shows that the greedy LZBE factorization can be asymptotically smaller than the smallest grammar in some cases, establishing a separation between LZBE and grammar-based compression. The authors present a linear-space data structure combining interval-biased search trees and symmetric centroid path decomposition to achieve -time random access for LZBE-compressed strings, and a linear-time algorithm to compute the greedy LZBE factorization using a suffix-tree-based framework with advanced ancestorship structures. Together, these results position LZBE as a powerful, query-friendly compression scheme that balances strong representation with practical access efficiency.

Abstract

An LZ-like factorization of a string divides it into factors, each being either a single character or a copy of a preceding substring. While grammar-based compression schemes support efficient random access with space linear in the compressed size, no comparable guarantees are known for general LZ-like factorizations. This limitation motivated restricted variants such as LZ-End [Kreft and Navarro, 2013] and height-bounded LZ (LZHB) [Bannai et al., 2024], which trade off some compression efficiency for faster access. In this paper, we introduce LZ-Begin-End (LZBE), a new LZ-like variant in which every copy factor must refer to a contiguous sequence of preceding factors. This structural restriction ensures that any context-free grammar can be transformed into an LZBE factorization of the same size. We further study the greedy LZBE factorization, which selects each copy factor to be as long as possible while processing the input from left to right, and show that it can be computed in linear time. Moreover, we exhibit a family of strings for which the greedy LZBE factorization is asymptotically smaller than the smallest grammar. These results demonstrate that the LZBE scheme is strictly more expressive than grammar-based compression in the worst case. To support fast queries, we propose a data structure for LZBE-compressed strings that permits O(log n)-time random access within space linear in the compressed size, where n is the length of the input string.

Paper Structure

This paper contains 14 sections, 16 theorems, 12 equations, 6 figures, 1 algorithm.

Key Result

Theorem 1

Given a CFG of size $g$ representing string $T$, we can construct an LZBE factorization with at most $g$ factors.

Figures (6)

  • Figure 1: The greedy LZBE factorization of $ababbababab$ and the pruned derivation tree of the corresponding CFG. Dotted lines and arrows indicate the sources of copy factors.
  • Figure 2: The derivation tree of a grammar generating $abaabaabaab$ (left) and its pruned derivation tree (right). The LZBE factorization corresponding to the grammar is shown below the pruned tree.
  • Figure 3: Examples of bottom-up computation of ${{\rm val}({{{\rm head}({X})}})}$ and ${{\rm val}({{{\rm tail}({X})}})}$. In the left example, the value $q_2$ is computed as $q_2 = {{\rm val}({Q_2})} = {{\rm val}({{{\rm tail}({Y})} \cdot {{\rm head}({Z})}})} = {{\rm val}({{{\rm tail}({Y})}})} \otimes {{\rm val}({{{\rm head}({Z})}})}$.
  • Figure 4: The subtree of the interval-biased search tree rooted at the node $v_c = {{\rm lca}({v_i, v_{j-1}})}$, used to locate the interval containing a query value $q \in [a_i, a_j)$. The hint nodes $v_c$, $v_l = {{\rm lca}({v_i, v_{c-1}})}$, and $v_r = {{\rm lca}({v_{c+1}, v_{j-1}})}$ allow us to restrict the search to subtrees of limited range.
  • Figure 5: An illustration of a heavy path in the dependency DAG. The path consists of five nodes $F_{i_1}$ through $F_{i_5}$, with solid arrows for heavy edges and dotted arrows for light edges. Factors not belonging to this heavy path are shown in gray. The jump sequence from $(F_{i_1}, r_1)$, where $r_1 \in [R_5, R_2)$, has exit interval $I_4^R$ and exit position $(F_{i_4}, r_4)$, with $r_4 = R_5 - r_1$.
  • ...and 1 more figures

Theorems & Definitions (28)

  • Theorem 1
  • proof
  • Theorem 2
  • Definition 3: DBLP:journals/ijcga/ChazelleR91
  • Theorem 4: DBLP:journals/ijcga/ChazelleR91
  • Lemma 5
  • proof
  • proof : Proof of Theorem \ref{['thm:LZBE_grammar_differ']}
  • Theorem 6: DBLP:journals/corr/abs-2406-06321
  • Corollary 7
  • ...and 18 more