Sketching and Streaming for Dictionary Compression

Ruben Becker; Matteo Canton; Davide Cenzato; Sung-Hwan Kim; Bojana Kodric; Nicola Prezza

Sketching and Streaming for Dictionary Compression

Ruben Becker, Matteo Canton, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Nicola Prezza

TL;DR

The paper addresses the challenge of estimating the output sizes of dictionary compressors in sub-linear space by leveraging the normalized substring complexity $\delta$. It introduces a data sketch based on sampled substring lengths and Rabin fingerprints combined with a count-distinct sketch, achieving a $(1\pm\varepsilon)$-approximation of $\delta$ with space $O(\varepsilon^{-3}\log n + \varepsilon^{-1}\log^2 n)$ and mergeable sketches for efficient composition. A streaming version attains $O(\sqrt{n}\log n)$ working space with polylogarithmic delay, enabling practical processing of very large data streams. The approach yields accurate clustering-like distance measures $\mathrm{NCD}_\delta$, accelerates all-pairs distance computations, and demonstrates strong performance on genomic and phylogenetic datasets, highlighting the practical impact of sub-linear sketching for dictionary-based similarity measures and compression-aware analysis.

Abstract

We initiate the study of sub-linear sketching and streaming techniques for estimating the output size of common dictionary compressors such as Lempel-Ziv '77, the run-length Burrows-Wheeler transform, and grammar compression. To this end, we focus on a measure that has recently gained much attention in the information-theoretic community and which approximates up to a polylogarithmic multiplicative factor the output sizes of those compressors: the normalized substring complexity function $δ$. We present a data sketch of $O(ε^{-3}\log n + ε^{-1}\log^2 n)$ words that allows computing a multiplicative $(1\pm ε)$-approximation of $δ$ with high probability, where $n$ is the string length. The sketches of two strings $S_1,S_2$ can be merged in $O(ε^{-1}\log^2 n)$ time to yield the sketch of $\{S_1,S_2\}$, speeding up by orders of magnitude tasks such as the computation of all-pairs \emph{Normalized Compression Distances} (NCD). If random access is available on the input, our sketch can be updated in $O(ε^{-1}\log^2 n)$ time for each character right-extension of the string. This yields a polylogarithmic-space algorithm for approximating $δ$, improving exponentially over the working space of the state-of-the-art algorithms running in nearly-linear time. Motivated by the fact that random access is not always available on the input data, we then present a streaming algorithm computing our sketch in $O(\sqrt n \cdot \log n)$ working space and $O(ε^{-1}\log^2 n)$ worst-case delay per character. We show that an implementation of our streaming algorithm can estimate δ on a dataset of 189GB with a throughput of 203MB per minute while using only 5MB of RAM, and that our sketch speeds up the computation of all-pairs NCD distances by one order of magnitude, with applications to phylogenetic tree reconstruction.

Sketching and Streaming for Dictionary Compression

TL;DR

The paper addresses the challenge of estimating the output sizes of dictionary compressors in sub-linear space by leveraging the normalized substring complexity

. It introduces a data sketch based on sampled substring lengths and Rabin fingerprints combined with a count-distinct sketch, achieving a

-approximation of

with space

and mergeable sketches for efficient composition. A streaming version attains

working space with polylogarithmic delay, enabling practical processing of very large data streams. The approach yields accurate clustering-like distance measures

, accelerates all-pairs distance computations, and demonstrates strong performance on genomic and phylogenetic datasets, highlighting the practical impact of sub-linear sketching for dictionary-based similarity measures and compression-aware analysis.

Abstract

. We present a data sketch of

words that allows computing a multiplicative

-approximation of

with high probability, where

is the string length. The sketches of two strings

can be merged in

time to yield the sketch of

, speeding up by orders of magnitude tasks such as the computation of all-pairs \emph{Normalized Compression Distances} (NCD). If random access is available on the input, our sketch can be updated in

time for each character right-extension of the string. This yields a polylogarithmic-space algorithm for approximating

, improving exponentially over the working space of the state-of-the-art algorithms running in nearly-linear time. Motivated by the fact that random access is not always available on the input data, we then present a streaming algorithm computing our sketch in

working space and

worst-case delay per character. We show that an implementation of our streaming algorithm can estimate δ on a dataset of 189GB with a throughput of 203MB per minute while using only 5MB of RAM, and that our sketch speeds up the computation of all-pairs NCD distances by one order of magnitude, with applications to phylogenetic tree reconstruction.

Paper Structure (15 sections, 10 theorems, 12 equations, 3 figures, 2 tables)

This paper contains 15 sections, 10 theorems, 12 equations, 3 figures, 2 tables.

Introduction
Overview of the paper.
Related work.
Preliminaries
Properties of $\delta$ and $\mathop{\mathrm{NCD}}\nolimits_\delta$
A data sketch for estimating $\delta$
Streaming algorithm
Implementation and experiments
Missing Proofs
Details on Bookmarking the RLBWT
Detailed Experimental Results
Estimation of $d_k$.
Experiments on phylogenetic tree reconstruction.
Running Time.
Funding

Key Result

Lemma 2.1

Letting $S$ be a string and $r$ be the number of equal-letter runs in $BWT(S^R)$, there exists a data structure of $O(r)$ words storing $BWT(S^R)$ supporting right-extensions of $S$ (i.e. $BWT(S^R) \rightarrow BWT((Sa)^R)$, for any $a\in \Sigma$) in $O(\log|S|)$ time. Within the same time, the struc

Figures (3)

Figure 1: Example showing how the BWT (of the reversed stream) is updated upon character right-extensions of the stream, and how the bookmark $j$ corresponding to window length $k=2$ is initialized and updated. Top left: empty stream ($S = \$$). Top right: a new character $b$ arrives on the stream ($S = \$b$): in the BWT, $ is replaced by $b$ and a new $ is inserted in the position corresponding to the lexicographic rank $i=2$ of the new suffix $b\$$. Position $i$ is computed in $O(\log|S|)$ time using the algorithm described in policriti2018lz77. Bottom left: a new character $a$ arrives on the stream ($S = \$ba$): in the BWT, $ is replaced by $a$ and a new $ is inserted in the position corresponding to the lexicographic rank $i=2$ of the new suffix $ab\$$. Since the stream length is equal to $k+1=3$, we initialize the bookmark $j \leftarrow BWT.LF(i) = BWT.LF(2) = 1$. Note that $BWT[j]=b$ indeed contains character $S[|S|-k+1] = b$. Bottom right: a new character $a$ arrives on the stream ($S = \$baa$): in the BWT, $ is replaced by $a$ and a new $ is inserted in the position corresponding to the lexicographic rank $i=2$ of the new suffix $aab\$$. Since $1= j < i = 2$ ($ is inserted after position $j$), $j=1$ is not modified (otherwise, it would have been incremented by 1). Finally, we update $j$ by advancing it by one position in the text: $j \leftarrow BWT.LF(j) = BWT.LF(1) = 4$. Note that $BWT[j]=a$ indeed contains character $S[|S|-k+1] = a$. Importantly, the data structure of policriti2018lz77 uses always a space proportional to the number $r$ of equal-letter runs of the BWT.
Figure 2: From left to right: $\tilde{\delta}$ error distribution on P&C repetitive corpus, lineplot showing five repetitiveness measures (normalized to $[0,1]$) computed on increasing prefixes of para.
Figure 3: Two similar phylogenetic trees constructed using normalized compression distance (NCD) with estimated normalized substring complexity $\tilde{\delta}$ (left) and a popular compression software $\mathtt{xz}$ (right). Normalized Robinson-Foulds distance is 0.194.

Theorems & Definitions (15)

Lemma 2.1: policriti2018lz77, Thm. 2
Lemma 3.0
Corollary 3.1
Lemma 3.1
Definition 4.1: Sketch for $\delta$
Lemma 4.1
Theorem 4.2
Theorem 5.1
proof
Lemma A.2
...and 5 more

Sketching and Streaming for Dictionary Compression

TL;DR

Abstract

Sketching and Streaming for Dictionary Compression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (15)