Table of Contents
Fetching ...

The Complexity of Dynamic LZ77 is $\tildeΘ(n^{2/3})$

Itai Boneh, Shay Golan, Matan Kraus

TL;DR

This work establishes the first fully dynamic maintenance framework for LZ77 factorization with near-linear preprocessing and sublinear update time. The authors introduce an LPF-tree representation and a heavy-indexing scheme, implemented via top trees, to support SelectPhrase, ContainingPhrase, and LZLength queries with polylogarithmic query time and updates of order $ ilde{O}(n^{2/3})$. They prove tight bounds by a matching lower bound under the Strong Exponential Time Hypothesis, using a sophisticated reduction from Orthogonal Vectors to dynamic LZ77 maintenance. The results reveal a unique position among string problems, showing near-linear static times and sublinear dynamic times, and open avenues for space- and variant- compression research in dynamic settings.

Abstract

The Lempel-Ziv 77 (LZ77) factorization is a fundamental compression scheme widely used in text processing and data compression. In this work, we investigate the time complexity of maintaining the LZ77 factorization of a dynamic string. By establishing matching upper and lower bounds, we fully characterize the complexity of this problem. We present an algorithm that efficiently maintains the LZ77 factorization of a string $S$ undergoing edit operations, including character substitutions, insertions, and deletions. Our data structure can be constructed in $\tilde{O}(n)$ time for an initial string of length $n$ and supports updates in $\tilde{O}(n^{2/3})$ time, where $n$ is the current length of $S$. Additionally, we prove that no algorithm can achieve an update time of $O(n^{2/3-\varepsilon})$ unless the Strong Exponential Time Hypothesis fails. This lower bound holds even in the restricted setting where only substitutions are allowed and only the length of the LZ77 factorization is maintained.

The Complexity of Dynamic LZ77 is $\tildeΘ(n^{2/3})$

TL;DR

This work establishes the first fully dynamic maintenance framework for LZ77 factorization with near-linear preprocessing and sublinear update time. The authors introduce an LPF-tree representation and a heavy-indexing scheme, implemented via top trees, to support SelectPhrase, ContainingPhrase, and LZLength queries with polylogarithmic query time and updates of order . They prove tight bounds by a matching lower bound under the Strong Exponential Time Hypothesis, using a sophisticated reduction from Orthogonal Vectors to dynamic LZ77 maintenance. The results reveal a unique position among string problems, showing near-linear static times and sublinear dynamic times, and open avenues for space- and variant- compression research in dynamic settings.

Abstract

The Lempel-Ziv 77 (LZ77) factorization is a fundamental compression scheme widely used in text processing and data compression. In this work, we investigate the time complexity of maintaining the LZ77 factorization of a dynamic string. By establishing matching upper and lower bounds, we fully characterize the complexity of this problem. We present an algorithm that efficiently maintains the LZ77 factorization of a string undergoing edit operations, including character substitutions, insertions, and deletions. Our data structure can be constructed in time for an initial string of length and supports updates in time, where is the current length of . Additionally, we prove that no algorithm can achieve an update time of unless the Strong Exponential Time Hypothesis fails. This lower bound holds even in the restricted setting where only substitutions are allowed and only the length of the LZ77 factorization is maintained.

Paper Structure

This paper contains 19 sections, 28 theorems, 18 equations, 6 figures.

Key Result

Theorem 1

There is a data structure solving prb:ub with $\tilde{O}(n)$ preprocessing time, $\tilde{O}(1)$ time per query and $\tilde{O}(n^{2/3})$ time per update, where $n$ is the current length of $S$.

Figures (6)

  • Figure 1: Example of an $L$-heavy index. Notice that regardless of the update in $z$, the equality between $S[i..z-1]$ and $S[\mathsf{LPFpos}_S(i)..\mathsf{LPFpos}_S(i) + z-i-1]$ still persists in $S'$. It follows that $M_L$ is contained in $S'[i..i+\mathsf{LPF}_{S'}(i)]$ as well.
  • Figure 2: Example of a light index.
  • Figure 3: An illustration of the case where $z\in [\mathsf{LPFpos}_S(i) .. \mathsf{LPFpos}_S(i) + \mathsf{LPF}_S(i))$, and $i$ is an $L$-heavy index
  • Figure 4: Example of $R$-heavy index. Notice that unlike $L$-heavy index, $\mathsf{LPFpos}$ is not the reason that $S'[i..i+\mathsf{LPF}_{S'}(i))$ contains an occurrence of $M_R$, as the equality between $i$ and $\mathsf{LPFpos}_{S}(i)$ in $S'$ 'breaks' at $z$, right before the start of $M_R$.
  • Figure 5: An example of light index of arising from $\mathsf{LPF}_{S'}(i) = a+b+1$. Here, since the new $\mathsf{LPF}$ value is small, it must hold that $i$ is the first occurrence after $z$ of $S[z-a..z+m]$.
  • ...and 1 more figures

Theorems & Definitions (31)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Definition 4: L-Heavy index
  • Lemma 10: kempa2022dynamic
  • Lemma 11
  • Lemma 11
  • Lemma 12
  • Lemma 13: BCR24
  • Lemma 14
  • ...and 21 more