Table of Contents
Fetching ...

String Partition for Building Long Burrows-Wheeler Transforms

Enno Adler, Stefan Böttcher, Rita Hartel

TL;DR

The work tackles the scalability of Burrows-Wheeler Transform construction for long strings by partitioning the input string $S$ into a collection $W$ of shorter words using the prefix of the suffix array $PSA(S)$ and an Omega-based partitioning rule. This enables applying multi-string BWT construction algorithms to the partitions, with partDNA providing a concrete DNA-focused implementation and enabling a favorable time–memory trade-off when combined with IBB; the approach preserves the ability to recover $BWT(S)$ from $BWT(W)$ after removing end-marker runs. The Partition Theorem formalizes how to derive $W$ from $S$ via $W_i=S[\, ext{Omega}(i), ext{PSA}(S)[i]-1]$ and how $SA(W)$ and $DA(W)$ relate to $SA(S)$ through the $word$ and $position$ mappings, while the experimental results show that partDNA+IBB achieves a competitive pareto-optimal point among state-of-the-art BWT construction methods on real genomes. Overall, the method generalizes to alphabets beyond DNA and offers a scalable route to long-string indexing by reducing a single long-string BWT problem to a multi-string setting with provable correctness and favorable practical performance.

Abstract

Constructing the Burrows-Wheeler transform (BWT) for long strings poses significant challenges regarding construction time and memory usage. We use a prefix of the suffix array to partition a long string into shorter substrings, thereby enabling the use of multi-string BWT construction algorithms to process these partitions fast. We provide an implementation, partDNA, for DNA sequences. Through comparison with state-of-the-art BWT construction algorithms, we show that partDNA with IBB offers a novel trade-off for construction time and memory usage for BWT construction on real genome datasets. Beyond this, the proposed partitioning strategy is applicable to strings of any alphabet.

String Partition for Building Long Burrows-Wheeler Transforms

TL;DR

The work tackles the scalability of Burrows-Wheeler Transform construction for long strings by partitioning the input string into a collection of shorter words using the prefix of the suffix array and an Omega-based partitioning rule. This enables applying multi-string BWT construction algorithms to the partitions, with partDNA providing a concrete DNA-focused implementation and enabling a favorable time–memory trade-off when combined with IBB; the approach preserves the ability to recover from after removing end-marker runs. The Partition Theorem formalizes how to derive from via and how and relate to through the and mappings, while the experimental results show that partDNA+IBB achieves a competitive pareto-optimal point among state-of-the-art BWT construction methods on real genomes. Overall, the method generalizes to alphabets beyond DNA and offers a scalable route to long-string indexing by reducing a single long-string BWT problem to a multi-string setting with provable correctness and favorable practical performance.

Abstract

Constructing the Burrows-Wheeler transform (BWT) for long strings poses significant challenges regarding construction time and memory usage. We use a prefix of the suffix array to partition a long string into shorter substrings, thereby enabling the use of multi-string BWT construction algorithms to process these partitions fast. We provide an implementation, partDNA, for DNA sequences. Through comparison with state-of-the-art BWT construction algorithms, we show that partDNA with IBB offers a novel trade-off for construction time and memory usage for BWT construction on real genome datasets. Beyond this, the proposed partitioning strategy is applicable to strings of any alphabet.
Paper Structure (11 sections, 7 theorems, 24 equations, 5 figures, 1 table)

This paper contains 11 sections, 7 theorems, 24 equations, 5 figures, 1 table.

Key Result

theorem thmcountertheorem

Let $l (= k + 1)$ be the size of $W$ and $m (= n + l)$ be the total length of $BWT(W)$. Then, for all $i < m$: Thus, if we know $BWT(W)$, we get by removing the $\#$-run

Figures (5)

  • Figure 1: Summary of the relationship of single-string and multi-string BWT construction algorithms and the contribution of this paper. The $BWT$ for $S$ can also be obtained by partitioning $S$, using a multi-string construction algorithm on the sorted partition, and removing the $\#$-run at the end. The output of the multi-string BWT construction algorithm is equal to the BWT for $W'$.
  • Figure 2: Concept of partitioning $S$ using $PSA(S)$ with $k = 8$ to obtain the collection $W$. The suffix of a word $W_i$ in $S$ could either be expressed by characters of $S$ or by words from $W$ in the order of their appearance in $S$
  • Figure 3: Example calculation of partitioning using $h = 3$ on the word $S$. In Steps 2 and 3, we only reorder the array ID, the reordering of columns of the other elements is only shown for illustration. Steps 2 and 3 are done in place, so there is no action needed to go from (4.a) and (2.c) to (4.b).
  • Figure 4: The buckets and the steps of induced suffix sorting that start and end inside the displayed interval. The name of a bucket is a shared prefix of the suffixes. For the $A^v$ buckets it is required that the next symbol is not an $A$, otherwise, the $AAA$ and $AAAA$ bucket would also belong to the $AA$ bucket. We induce the positions from the $\$$ bucket to the next bucket on the right and reduce the suffix array entry by 1 if and only if the symbol at the position before the entry is an $A$, which is the symbol in the BWT row. In the same way, we can fill the $A^v$ buckets right to left.
  • Figure 5: BWT construction times and maximum resident set sizes (max-rss). Grey polygons in scatter plots belong to a partioned dataset: the grey tone determines the BWT construction algorithm and the number of edges the used parameter $h$, as the legends explain. Missing points mean that the construction algorithms abort or do not create an output file.

Theorems & Definitions (13)

  • theorem thmcountertheorem
  • proposition thmcounterproposition
  • proof
  • proposition thmcounterproposition
  • proof
  • proposition thmcounterproposition
  • proof
  • theorem thmcountertheorem
  • proof
  • theorem thmcountertheorem
  • ...and 3 more