Table of Contents
Fetching ...

Efficient terabyte-scale text compression via stable local consistency and parallel grammar processing

Diego Diaz-Dominguez

TL;DR

This paper addresses the challenge of compressing terabyte-scale text collections efficiently by introducing stable locally consistent parsing, enabling independent parallel grammar construction and merging. The core approach builds and merges locally consistent grammars via BuildGram and MergeGrams, with a parallel variant PBuildGram; BuildGram runs in $O(n)$ time w.h.p. and uses $O(G\log G)$ bits of working space, while MergeGrams runs in $O(G_a+G_b)$ time with $O(G_a\log g_a + G_b\log g_b)$ bits. The authors demonstrate practical scalability by processing 7.9 TB of bacterial genomes in about nine hours on 16 threads, achieving about 85x compression with modest memory ($0.43$ bits/symbol). The results suggest that stable local consistency and parallel grammar processing offer a viable path to scalable compression and fast downstream processing on terabyte-scale data.

Abstract

We present a highly parallelizable text compression algorithm that scales efficiently to terabyte-sized datasets. Our method builds on locally consistent grammars, a lightweight form of compression, combined with simple recompression techniques to achieve further space reductions. Locally consistent grammar algorithms are particularly suitable for scaling, as they need minimal satellite information to compact the text. We introduce a novel concept to enable parallelisation, stable local consistency. A grammar algorithm ALG is stable, if for any pattern $P$ occurring in a collection $\mathcal{T}=\{T_1, T_2, \ldots, T_k\}$, the instances $ALG(T_1), ALG(T_2), \ldots, ALG(T_k)$ independently produce cores for $P$ with the same topology. In a locally consistent grammar, the core of $P$ is a subset of nodes and edges in $\mathcal{T}$'s parse tree that remains the same in all the occurrences of $P$. This feature is important to achieve compression, but it only holds if ALG synchronises the parsing of the strings, for instance, by defining a common set of nonterminal symbols for them. Stability removes the need for synchronisation during the parsing phase. Consequently, we can run $ALG(T_1), ALG(T_2), \ldots, ALG(T_k)$ fully in parallel and then merge the resulting grammars into a single compressed output equivalent to $ALG(\mathcal{T})$. We implemented our ideas and tested them on massive datasets. Our results showed that our method could process a diverse collection of bacterial genomes (7.9 TB) in around nine hours, requiring 16 threads and 0.43 bits/symbol of working memory, producing a compressed representation 85 times smaller than the original input.

Efficient terabyte-scale text compression via stable local consistency and parallel grammar processing

TL;DR

This paper addresses the challenge of compressing terabyte-scale text collections efficiently by introducing stable locally consistent parsing, enabling independent parallel grammar construction and merging. The core approach builds and merges locally consistent grammars via BuildGram and MergeGrams, with a parallel variant PBuildGram; BuildGram runs in time w.h.p. and uses bits of working space, while MergeGrams runs in time with bits. The authors demonstrate practical scalability by processing 7.9 TB of bacterial genomes in about nine hours on 16 threads, achieving about 85x compression with modest memory ( bits/symbol). The results suggest that stable local consistency and parallel grammar processing offer a viable path to scalable compression and fast downstream processing on terabyte-scale data.

Abstract

We present a highly parallelizable text compression algorithm that scales efficiently to terabyte-sized datasets. Our method builds on locally consistent grammars, a lightweight form of compression, combined with simple recompression techniques to achieve further space reductions. Locally consistent grammar algorithms are particularly suitable for scaling, as they need minimal satellite information to compact the text. We introduce a novel concept to enable parallelisation, stable local consistency. A grammar algorithm ALG is stable, if for any pattern occurring in a collection , the instances independently produce cores for with the same topology. In a locally consistent grammar, the core of is a subset of nodes and edges in 's parse tree that remains the same in all the occurrences of . This feature is important to achieve compression, but it only holds if ALG synchronises the parsing of the strings, for instance, by defining a common set of nonterminal symbols for them. Stability removes the need for synchronisation during the parsing phase. Consequently, we can run fully in parallel and then merge the resulting grammars into a single compressed output equivalent to . We implemented our ideas and tested them on massive datasets. Our results showed that our method could process a diverse collection of bacterial genomes (7.9 TB) in around nine hours, requiring 16 threads and 0.43 bits/symbol of working memory, producing a compressed representation 85 times smaller than the original input.

Paper Structure

This paper contains 27 sections, 2 theorems, 6 equations, 6 figures, 2 tables.

Key Result

Theorem 2

Let $\mathcal{T}$ be a collection of $k$ strings and $||\mathcal{T}||=n$ symbols, where the longest string has length $n_{max}$. Additionally, let $\mathcal{H}$ be a set of hash functions with $|\mathcal{H}| \geq \lceil \log n_{max} \rceil + 1$ elements. $\textsc{BuildGram}(\mathcal{T}, \mathcal{H})

Figures (6)

  • Figure 1: Locally consistent grammar compression of $P[1..33]$. The first row (bottom-up) is $P$, and the next rows are the metasymbols for parsing rounds. The grey boxes are breaks. The boxes below the thick black are the core of $P$. The $A^{i}$ and $Z^{i}$ change if the context of $P$ changes.
  • Figure 2: Example of BuildGram with the input string agtagtagtgtagtaggagatcggag and the hash functions $\mathcal{H}=\{h^{0}, h^{1}, h^{2}, h^{3}\}$. The grey boxes indicate the breaks induced by $\mathcal{H}$.
  • Figure 3: Example of $\textsc{MergeGrams}(\mathcal{G}_a, \mathcal{G}_b)$. As the parsing is stable, $T_a[1..19]=T_b[1..19]$ have cores (dashed boxes) with the same topology in $\mathcal{G}_a$ and $\mathcal{G}_b$ . In (B-C), the nonterminals represent the relative position of their rules in their corresponding levels. For example, the left-hand side of $2 \rightarrow 5 2 4$ in $\mathcal{G}_b$ (side B) is $2$ because that rule is the second in level 2. On the other hand, the symbol $2$ in $5 2 4$ refers to the second rule of level 1. In the first merge round, MergeGrams checks which right-hand sides in level one of $\mathcal{G}_b$ are also right-hand sides in level one of $\mathcal{G}_b$ (dashed lines in side B). Only tc is not in $\mathcal{G}_a$, so the algorithm appends it at the end of level one in $\mathcal{G}_a$ and assigns it the new metasymbol $6$ (side C). Subsequently, it discards level one in $\mathcal{G}_b$ and updates the right-hand sides of level two in $\mathcal{G}_b$ according to their corresponding metasymbols in $\mathcal{G}_a$. In (C), the rule $1 \rightarrow 1 1 4 3 1$ becomes $1 \rightarrow 2 2 3 4 2$ and $2 \rightarrow 5 2 4$ becomes $2 \rightarrow 1 6 3$. For example, $5$ becomes $1$ on the right-hand side of $2 \rightarrow 5 2 4$ because the level one rule $5 \rightarrow \texttt{agg}$ in $\mathcal{G}_b$ matches $1 \rightarrow \texttt{agg}$ in $\mathcal{G}_a$ (see dashed lines in side B). After the update, MergeGrams goes to the next round and operates recursively.
  • Figure 4: Performance of LCG. (A) Running time breakdown. (B) Memory peak breakdown. "Fps" are the fingerprints in PBuildGram, and "Sat. data" are arrays and grammars of the buffers (Section \ref{['sec:gram_enc']}). (C) Performance of LCG in HUM relative to the number of threads. The left y axis is the compression speed and the right y axis is the memory peak.
  • Figure 5: Schematic representation of PBuildGram. The steps (a-e) indicate the cycle of a buffer during the compression step.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Definition 1
  • Theorem 2
  • Theorem 3