Sketching and Streaming for Dictionary Compression
Ruben Becker, Matteo Canton, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Nicola Prezza
TL;DR
The paper addresses the challenge of estimating the output sizes of dictionary compressors in sub-linear space by leveraging the normalized substring complexity $\delta$. It introduces a data sketch based on sampled substring lengths and Rabin fingerprints combined with a count-distinct sketch, achieving a $(1\pm\varepsilon)$-approximation of $\delta$ with space $O(\varepsilon^{-3}\log n + \varepsilon^{-1}\log^2 n)$ and mergeable sketches for efficient composition. A streaming version attains $O(\sqrt{n}\log n)$ working space with polylogarithmic delay, enabling practical processing of very large data streams. The approach yields accurate clustering-like distance measures $\mathrm{NCD}_\delta$, accelerates all-pairs distance computations, and demonstrates strong performance on genomic and phylogenetic datasets, highlighting the practical impact of sub-linear sketching for dictionary-based similarity measures and compression-aware analysis.
Abstract
We initiate the study of sub-linear sketching and streaming techniques for estimating the output size of common dictionary compressors such as Lempel-Ziv '77, the run-length Burrows-Wheeler transform, and grammar compression. To this end, we focus on a measure that has recently gained much attention in the information-theoretic community and which approximates up to a polylogarithmic multiplicative factor the output sizes of those compressors: the normalized substring complexity function $δ$. We present a data sketch of $O(ε^{-3}\log n + ε^{-1}\log^2 n)$ words that allows computing a multiplicative $(1\pm ε)$-approximation of $δ$ with high probability, where $n$ is the string length. The sketches of two strings $S_1,S_2$ can be merged in $O(ε^{-1}\log^2 n)$ time to yield the sketch of $\{S_1,S_2\}$, speeding up by orders of magnitude tasks such as the computation of all-pairs \emph{Normalized Compression Distances} (NCD). If random access is available on the input, our sketch can be updated in $O(ε^{-1}\log^2 n)$ time for each character right-extension of the string. This yields a polylogarithmic-space algorithm for approximating $δ$, improving exponentially over the working space of the state-of-the-art algorithms running in nearly-linear time. Motivated by the fact that random access is not always available on the input data, we then present a streaming algorithm computing our sketch in $O(\sqrt n \cdot \log n)$ working space and $O(ε^{-1}\log^2 n)$ worst-case delay per character. We show that an implementation of our streaming algorithm can estimate δ on a dataset of 189GB with a throughput of 203MB per minute while using only 5MB of RAM, and that our sketch speeds up the computation of all-pairs NCD distances by one order of magnitude, with applications to phylogenetic tree reconstruction.
