Efficient terabyte-scale text compression via stable local consistency and parallel grammar processing

Diego Diaz-Dominguez

Efficient terabyte-scale text compression via stable local consistency and parallel grammar processing

Diego Diaz-Dominguez

TL;DR

This paper addresses the challenge of compressing terabyte-scale text collections efficiently by introducing stable locally consistent parsing, enabling independent parallel grammar construction and merging. The core approach builds and merges locally consistent grammars via BuildGram and MergeGrams, with a parallel variant PBuildGram; BuildGram runs in $O(n)$ time w.h.p. and uses $O(G\log G)$ bits of working space, while MergeGrams runs in $O(G_a+G_b)$ time with $O(G_a\log g_a + G_b\log g_b)$ bits. The authors demonstrate practical scalability by processing 7.9 TB of bacterial genomes in about nine hours on 16 threads, achieving about 85x compression with modest memory ($0.43$ bits/symbol). The results suggest that stable local consistency and parallel grammar processing offer a viable path to scalable compression and fast downstream processing on terabyte-scale data.

Abstract

We present a highly parallelizable text compression algorithm that scales efficiently to terabyte-sized datasets. Our method builds on locally consistent grammars, a lightweight form of compression, combined with simple recompression techniques to achieve further space reductions. Locally consistent grammar algorithms are particularly suitable for scaling, as they need minimal satellite information to compact the text. We introduce a novel concept to enable parallelisation, stable local consistency. A grammar algorithm ALG is stable, if for any pattern $P$ occurring in a collection $\mathcal{T}=\{T_1, T_2, \ldots, T_k\}$, the instances $ALG(T_1), ALG(T_2), \ldots, ALG(T_k)$ independently produce cores for $P$ with the same topology. In a locally consistent grammar, the core of $P$ is a subset of nodes and edges in $\mathcal{T}$'s parse tree that remains the same in all the occurrences of $P$. This feature is important to achieve compression, but it only holds if ALG synchronises the parsing of the strings, for instance, by defining a common set of nonterminal symbols for them. Stability removes the need for synchronisation during the parsing phase. Consequently, we can run $ALG(T_1), ALG(T_2), \ldots, ALG(T_k)$ fully in parallel and then merge the resulting grammars into a single compressed output equivalent to $ALG(\mathcal{T})$. We implemented our ideas and tested them on massive datasets. Our results showed that our method could process a diverse collection of bacterial genomes (7.9 TB) in around nine hours, requiring 16 threads and 0.43 bits/symbol of working memory, producing a compressed representation 85 times smaller than the original input.

Efficient terabyte-scale text compression via stable local consistency and parallel grammar processing

TL;DR

time w.h.p. and uses

bits of working space, while MergeGrams runs in

time with

bits. The authors demonstrate practical scalability by processing 7.9 TB of bacterial genomes in about nine hours on 16 threads, achieving about 85x compression with modest memory (

bits/symbol). The results suggest that stable local consistency and parallel grammar processing offer a viable path to scalable compression and fast downstream processing on terabyte-scale data.

Abstract

occurring in a collection

, the instances

independently produce cores for

with the same topology. In a locally consistent grammar, the core of

is a subset of nodes and edges in

's parse tree that remains the same in all the occurrences of

. This feature is important to achieve compression, but it only holds if ALG synchronises the parsing of the strings, for instance, by defining a common set of nonterminal symbols for them. Stability removes the need for synchronisation during the parsing phase. Consequently, we can run

fully in parallel and then merge the resulting grammars into a single compressed output equivalent to

. We implemented our ideas and tested them on massive datasets. Our results showed that our method could process a diverse collection of bacterial genomes (7.9 TB) in around nine hours, requiring 16 threads and 0.43 bits/symbol of working memory, producing a compressed representation 85 times smaller than the original input.

Efficient terabyte-scale text compression via stable local consistency and parallel grammar processing

TL;DR

Abstract

Efficient terabyte-scale text compression via stable local consistency and parallel grammar processing

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (3)