Computing the LZ-End parsing: Easy to implement and practically efficient

Patrick Dinklage

Computing the LZ-End parsing: Easy to implement and practically efficient

Patrick Dinklage

TL;DR

This work targets practical computation of the LZ-End parsing, balancing compression quality with random access and implementability. It refines the $O(n\log\log n)$ parsing algorithm of Kempa & Kosolobov by introducing lazy evaluation and an associative, compact index that removes dependence on the suffix array after initialization, broadening practicality and reducing memory. The approach yields a simple, complete implementation with a full listing and demonstrates favorable parsing speed against the state of the art on the Pizza&Chili corpus, while noting tradeoffs for highly repetitive inputs. Overall, the method provides a near-linear, memory-efficient parser for LZ-End that is well-suited for streaming and indexing tasks in real-world data.

Abstract

The LZ-End parsing [Kreft & Navarro, 2011] of an input string yields compression competitive with the popular Lempel-Ziv 77 scheme, but also allows for efficient random access. Kempa and Kosolobov showed that the parsing can be computed in time and space linear in the input length [Kempa & Kosolobov, 2017], however, the corresponding algorithm is hardly practical. We put the spotlight on their suboptimal algorithm that computes the parsing in time $\mathcal{O}(n \lg\lg n)$. It requires a comparatively small toolset and is therefore easy to implement, but at the same time very efficient in practice. We give a detailed and simplified description with a full listing that incorporates undocumented tricks from the original implementation, but also uses lazy evaluation to reduce the workload in practice and requires less working memory by removing a level of indirection. We legitimize our algorithm in a brief benchmark, obtaining the parsing faster than the state of the art.

Computing the LZ-End parsing: Easy to implement and practically efficient

TL;DR

This work targets practical computation of the LZ-End parsing, balancing compression quality with random access and implementability. It refines the

parsing algorithm of Kempa & Kosolobov by introducing lazy evaluation and an associative, compact index that removes dependence on the suffix array after initialization, broadening practicality and reducing memory. The approach yields a simple, complete implementation with a full listing and demonstrates favorable parsing speed against the state of the art on the Pizza&Chili corpus, while noting tradeoffs for highly repetitive inputs. Overall, the method provides a near-linear, memory-efficient parser for LZ-End that is well-suited for streaming and indexing tasks in real-world data.

Abstract

. It requires a comparatively small toolset and is therefore easy to implement, but at the same time very efficient in practice. We give a detailed and simplified description with a full listing that incorporates undocumented tricks from the original implementation, but also uses lazy evaluation to reduce the workload in practice and requires less working memory by removing a level of indirection. We legitimize our algorithm in a brief benchmark, obtaining the parsing faster than the state of the art.

Paper Structure (12 sections, 1 theorem, 2 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 12 sections, 1 theorem, 2 equations, 3 figures, 1 table, 1 algorithm.

Introduction
Preliminaries
The LZ-End Parsing
Index Data Structures
Dynamic Associative Predecessor and Successor Data Structures
Computing the LZ-End Parsing
Finding Source Phrases
The Algorithm
Analysis
Preparing Efficient Random Access
Implementation
Handling Highly Repetitive Inputs

Key Result

Lemma 1

If $f_1 \cdots f_z$ is the LZ-End parsing of a string $S \in \Sigma^*$, then, for any character $\alpha \in \Sigma$, the last phrase in the LZ-End parsing of $S\alpha$ is (1) $f_{z-1} f_z \alpha$ or (2) $f_z \alpha$ or (3) $\alpha$.

Figures (3)

Figure 1: Character-wise LZ-End parsing of the string $S = \texttt{a}\,\texttt{b}\,\texttt{aa}\,\texttt{baa\$}$, applying the case distinction of Lemma \ref{['lemma:kk17']} in every step. Refer to Example \ref{['ex:kk']} for a description of the individual steps.
Figure 2: Search for copy source phrase candidates in suffix array space (FindCopySource). Positions that mark the ending locations in $M$ of already computed LZ-End phrases are indicated by the circles. Note that while $i'$ is the ending location of the most recent phrase $f_z$, it has not yet been entered into $M$. We do a predecessor query starting from $i'-1$ (LexSmallerPhrase) or a successor query starting from $i'+1$ (LexGreaterPhrase), giving us the locations $j'_L$ or $j'_R$ that mark the candidates $p_L$ and $p_R$, respectively. In case $p_L = z-1$ or $p_R = z-1$, we do another predecessor or successor query starting from $j'_L-1$ or $j'_R+1$, respectively, to find a candidate for merging. Using range minimum queries in the LCP array, we can find the longest common extension between the suffix of $\overleftarrow{S}$ starting at $A[i']$ and those starting at $A[j'_L]$ or $A[j'_R]$, respectively, which is the number of characters that can be copied from the corresponding source phrase.
Figure 3: The relevant data structures of Algorithm \ref{['algo:parse']} right before merging $f_5$ and $f_4$ in the final step of parsing $S = \texttt{a}\,\texttt{b}\,\texttt{aa}\,\texttt{ba}\,\texttt{abaa\$}$. $A$ and $H$ are the suffix and LCP array of $S$, respectively, and we associate a position $M[i]$ with a phrase iff that phrase ends at position $A[i]$ in $\overleftarrow{S}$. Note that $\overleftarrow{S}$ and $A$ are not actually stored but only shown for reference. The phrase $f_3$ is found as a suitable source phrase, allowing to copy up to $\text{rmq}_H(3,3)=4$ characters. As a result of the merge, the former boundary of phrase $f_4$ is removed from $M$. Refer to Example \ref{['ex:merge']} for an elaboration.

Theorems & Definitions (4)

Example 1
Lemma 1
Example 2
Example 3

Computing the LZ-End parsing: Easy to implement and practically efficient

TL;DR

Abstract

Computing the LZ-End parsing: Easy to implement and practically efficient

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (4)