String Partition for Building Long Burrows-Wheeler Transforms

Enno Adler; Stefan Böttcher; Rita Hartel

String Partition for Building Long Burrows-Wheeler Transforms

Enno Adler, Stefan Böttcher, Rita Hartel

TL;DR

The work tackles the scalability of Burrows-Wheeler Transform construction for long strings by partitioning the input string $S$ into a collection $W$ of shorter words using the prefix of the suffix array $PSA(S)$ and an Omega-based partitioning rule. This enables applying multi-string BWT construction algorithms to the partitions, with partDNA providing a concrete DNA-focused implementation and enabling a favorable time–memory trade-off when combined with IBB; the approach preserves the ability to recover $BWT(S)$ from $BWT(W)$ after removing end-marker runs. The Partition Theorem formalizes how to derive $W$ from $S$ via $W_i=S[\, ext{Omega}(i), ext{PSA}(S)[i]-1]$ and how $SA(W)$ and $DA(W)$ relate to $SA(S)$ through the $word$ and $position$ mappings, while the experimental results show that partDNA+IBB achieves a competitive pareto-optimal point among state-of-the-art BWT construction methods on real genomes. Overall, the method generalizes to alphabets beyond DNA and offers a scalable route to long-string indexing by reducing a single long-string BWT problem to a multi-string setting with provable correctness and favorable practical performance.

Abstract

Constructing the Burrows-Wheeler transform (BWT) for long strings poses significant challenges regarding construction time and memory usage. We use a prefix of the suffix array to partition a long string into shorter substrings, thereby enabling the use of multi-string BWT construction algorithms to process these partitions fast. We provide an implementation, partDNA, for DNA sequences. Through comparison with state-of-the-art BWT construction algorithms, we show that partDNA with IBB offers a novel trade-off for construction time and memory usage for BWT construction on real genome datasets. Beyond this, the proposed partitioning strategy is applicable to strings of any alphabet.

String Partition for Building Long Burrows-Wheeler Transforms

TL;DR

The work tackles the scalability of Burrows-Wheeler Transform construction for long strings by partitioning the input string

into a collection

of shorter words using the prefix of the suffix array

and an Omega-based partitioning rule. This enables applying multi-string BWT construction algorithms to the partitions, with partDNA providing a concrete DNA-focused implementation and enabling a favorable time–memory trade-off when combined with IBB; the approach preserves the ability to recover

from

after removing end-marker runs. The Partition Theorem formalizes how to derive

from

via

and how

and

relate to

through the

and

mappings, while the experimental results show that partDNA+IBB achieves a competitive pareto-optimal point among state-of-the-art BWT construction methods on real genomes. Overall, the method generalizes to alphabets beyond DNA and offers a scalable route to long-string indexing by reducing a single long-string BWT problem to a multi-string setting with provable correctness and favorable practical performance.

Abstract

Paper Structure (11 sections, 7 theorems, 24 equations, 5 figures, 1 table)

This paper contains 11 sections, 7 theorems, 24 equations, 5 figures, 1 table.

Introduction
Related Work
Preliminaries
Partition Theorem
Partition DNA Sequences: partDNA
Runtime Complexity
Experimental Results
Conclusion
Proof of the Partition Theorem
On the Size of the Reduced Problem compared to SA-IS
List of Symbols

Key Result

theorem thmcountertheorem

Let $l (= k + 1)$ be the size of $W$ and $m (= n + l)$ be the total length of $BWT(W)$. Then, for all $i < m$: Thus, if we know $BWT(W)$, we get by removing the $\#$-run

Figures (5)

Figure 1: Summary of the relationship of single-string and multi-string BWT construction algorithms and the contribution of this paper. The $BWT$ for $S$ can also be obtained by partitioning $S$, using a multi-string construction algorithm on the sorted partition, and removing the $\#$-run at the end. The output of the multi-string BWT construction algorithm is equal to the BWT for $W'$.
Figure 2: Concept of partitioning $S$ using $PSA(S)$ with $k = 8$ to obtain the collection $W$. The suffix of a word $W_i$ in $S$ could either be expressed by characters of $S$ or by words from $W$ in the order of their appearance in $S$
Figure 3: Example calculation of partitioning using $h = 3$ on the word $S$. In Steps 2 and 3, we only reorder the array ID, the reordering of columns of the other elements is only shown for illustration. Steps 2 and 3 are done in place, so there is no action needed to go from (4.a) and (2.c) to (4.b).
Figure 4: The buckets and the steps of induced suffix sorting that start and end inside the displayed interval. The name of a bucket is a shared prefix of the suffixes. For the $A^v$ buckets it is required that the next symbol is not an $A$, otherwise, the $AAA$ and $AAAA$ bucket would also belong to the $AA$ bucket. We induce the positions from the $\$$ bucket to the next bucket on the right and reduce the suffix array entry by 1 if and only if the symbol at the position before the entry is an $A$, which is the symbol in the BWT row. In the same way, we can fill the $A^v$ buckets right to left.
Figure 5: BWT construction times and maximum resident set sizes (max-rss). Grey polygons in scatter plots belong to a partioned dataset: the grey tone determines the BWT construction algorithm and the number of edges the used parameter $h$, as the legends explain. Missing points mean that the construction algorithms abort or do not create an output file.

Theorems & Definitions (13)

theorem thmcountertheorem
proposition thmcounterproposition
proof
proposition thmcounterproposition
proof
proposition thmcounterproposition
proof
theorem thmcountertheorem
proof
theorem thmcountertheorem
...and 3 more

String Partition for Building Long Burrows-Wheeler Transforms

TL;DR

Abstract

String Partition for Building Long Burrows-Wheeler Transforms

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (13)