String Partition for Building Long Burrows-Wheeler Transforms
Enno Adler, Stefan Böttcher, Rita Hartel
TL;DR
The work tackles the scalability of Burrows-Wheeler Transform construction for long strings by partitioning the input string $S$ into a collection $W$ of shorter words using the prefix of the suffix array $PSA(S)$ and an Omega-based partitioning rule. This enables applying multi-string BWT construction algorithms to the partitions, with partDNA providing a concrete DNA-focused implementation and enabling a favorable time–memory trade-off when combined with IBB; the approach preserves the ability to recover $BWT(S)$ from $BWT(W)$ after removing end-marker runs. The Partition Theorem formalizes how to derive $W$ from $S$ via $W_i=S[\, ext{Omega}(i), ext{PSA}(S)[i]-1]$ and how $SA(W)$ and $DA(W)$ relate to $SA(S)$ through the $word$ and $position$ mappings, while the experimental results show that partDNA+IBB achieves a competitive pareto-optimal point among state-of-the-art BWT construction methods on real genomes. Overall, the method generalizes to alphabets beyond DNA and offers a scalable route to long-string indexing by reducing a single long-string BWT problem to a multi-string setting with provable correctness and favorable practical performance.
Abstract
Constructing the Burrows-Wheeler transform (BWT) for long strings poses significant challenges regarding construction time and memory usage. We use a prefix of the suffix array to partition a long string into shorter substrings, thereby enabling the use of multi-string BWT construction algorithms to process these partitions fast. We provide an implementation, partDNA, for DNA sequences. Through comparison with state-of-the-art BWT construction algorithms, we show that partDNA with IBB offers a novel trade-off for construction time and memory usage for BWT construction on real genome datasets. Beyond this, the proposed partitioning strategy is applicable to strings of any alphabet.
