Table of Contents
Fetching ...

Optimal-Time Mapping in Run-Length Compressed PBWT

Paola Bonizzoni, Davide Cozzi, Younan Gao

TL;DR

This work addresses the memory-efficiency and speed of mapping in the run-length encoded PBWT for multi-allelic haplotype panels. It introduces an O(tilde{r})-word data structure that supports constant-time forward and backward stepping by partitioning runs into sub-runs and organizing them into SubIB and SubIF representations. The authors then show how to perform efficient prefix searches and haplotype retrieval within this compressed framework, achieving time bounds of O(m' log log_w σ + occ) for prefix queries and O(w + log log_w h) for retrieval, all in O(tilde{r}) space (or O(tilde{r}) plus h when needed). These results provide a theoretically optimal-time mapping within μ-PBWT, enabling scalable analysis of large haplotype panels and foundational improvements for related PBWT-based tools in genomics.

Abstract

The Positional Burrows--Wheeler Transform (PBWT) is a data structure designed for efficiently representing and querying large collections of sequences, such as haplotype panels in genomics. Forward and backward stepping operations -- analogues to LF- and FL-mapping in the traditional BWT -- are fundamental to the PBWT, underpinning many algorithms based on the PBWT for haplotype matching and related analyses. Although the run-length encoded variant of the PBWT (also known as the $μ$-PBWT) achieves $O(\newR)$-word space usage, where $\newR$ is the total number of runs, no data structure supporting both forward and backward stepping in constant time within this space bound was previously known. In this paper, we consider the multi-allelic PBWT that is extended from its original binary form to a general ordered alphabet $\{0, \dots, σ-1\}$. We first establish bounds on the size $\newR$ and then introduce a new $O(\newR)$-word data structure built over a list of haplotypes $\{S_1, \dots, S_\height\}$, each of length $\width$, that supports constant-time forward and backward stepping. We further revisit two key applications -- haplotype retrieval and prefix search -- leveraging our efficient forward stepping technique. Specifically, we design an $O(\newR)$-word space data structure that supports haplotype retrieval in $O(\log \log_{\word} h + \width)$ time. For prefix search, we present an $O(\height + \newR)$-word data structure that answers queries in $O(m' \log\log_{\word} σ+ \occ)$ time, where $m'$ denotes the length of the longest common prefix returned and $\occ$ denotes the number of haplotypes prefixed the longest prefix.

Optimal-Time Mapping in Run-Length Compressed PBWT

TL;DR

This work addresses the memory-efficiency and speed of mapping in the run-length encoded PBWT for multi-allelic haplotype panels. It introduces an O(tilde{r})-word data structure that supports constant-time forward and backward stepping by partitioning runs into sub-runs and organizing them into SubIB and SubIF representations. The authors then show how to perform efficient prefix searches and haplotype retrieval within this compressed framework, achieving time bounds of O(m' log log_w σ + occ) for prefix queries and O(w + log log_w h) for retrieval, all in O(tilde{r}) space (or O(tilde{r}) plus h when needed). These results provide a theoretically optimal-time mapping within μ-PBWT, enabling scalable analysis of large haplotype panels and foundational improvements for related PBWT-based tools in genomics.

Abstract

The Positional Burrows--Wheeler Transform (PBWT) is a data structure designed for efficiently representing and querying large collections of sequences, such as haplotype panels in genomics. Forward and backward stepping operations -- analogues to LF- and FL-mapping in the traditional BWT -- are fundamental to the PBWT, underpinning many algorithms based on the PBWT for haplotype matching and related analyses. Although the run-length encoded variant of the PBWT (also known as the -PBWT) achieves -word space usage, where is the total number of runs, no data structure supporting both forward and backward stepping in constant time within this space bound was previously known. In this paper, we consider the multi-allelic PBWT that is extended from its original binary form to a general ordered alphabet . We first establish bounds on the size and then introduce a new -word data structure built over a list of haplotypes , each of length , that supports constant-time forward and backward stepping. We further revisit two key applications -- haplotype retrieval and prefix search -- leveraging our efficient forward stepping technique. Specifically, we design an -word space data structure that supports haplotype retrieval in time. For prefix search, we present an -word data structure that answers queries in time, where denotes the length of the longest common prefix returned and denotes the number of haplotypes prefixed the longest prefix.
Paper Structure (21 sections, 15 theorems, 6 figures)

This paper contains 21 sections, 15 theorems, 6 figures.

Key Result

Lemma 1

BelazzouguiN15 Given an increasingly sorted list of $n'$ integers, drawn from the universe $\{0, \dots, \sigma-1\}$, there is a data structure that occupies $O(n' \log \sigma)$ bits of space and answers a predecessor query in $O(\log \log_{w} \sigma)$ time.

Figures (6)

  • Figure 1: Illustration of the algorithm scheme for building sub-runs in $\texttt{SubIB}_j$ and an example. In this example, $\texttt{SubIB}_{j-1} = \{\,[1,2], [3,3], [4,5], [6,6], [7,8], [9,11], [12,13], [14,14], [15,16]\,\}$ and $\texttt{intervals}_j = \{\,[1,1], [2,11], [12,16]\,\}$. The bijective function $\texttt{foreL}_{j-1}$ maps the list $\texttt{SubIB}_{j-1}$ to the list $\{\,[1,2], [3,3], [4,5], [6,7], [8,9], [10,10], [11,13], [14,14], [15,16]\,\}$. Intervals highlighted in the same color contain the same haplotype indices and indicate corresponding pairs under this bijection. After applying the normalization algorithm to $\texttt{intervals}_j$ and $\texttt{foreL}_{j-1}(\texttt{SubIB}_{j-1})$, the intervals in $\texttt{intervals}_j$ are partitioned into $\texttt{SubIB}_j = \{\,[1,1], [2,5], [6,10], [11,11], [12,16]\,\}$. Each interval in $\texttt{SubIB}_j$ overlaps with at most three intervals in $\texttt{foreL}_{j-1}(\texttt{SubIB}_{j-1})$.
  • Figure 3: Example of $\texttt{PBWT}$ and $\texttt{PA}$ built for bi-allelic haplotypes $\{S_1, S_2, S_3, S_4, S_5\}$. The operation $\texttt{fore}[5][4]$ returns $2$, and the position 2 is in the first run of the fifth column of the PBWT.
  • Figure 4: Example of the normalization algorithm. Given $I_p = \{[1,1], [2,11], [12,16]\}$ and $I_q = \{[1,2], [3,3], [4,5], [6,7], [8,9], [10,10], [11,13], [14,14], [15,16]\}$, the algorithm outputs $\hat{I}_p = \{[1,1], [2,5], [6,10], [11,11], [12,16]\}$, where each interval in $\hat{I}_p$ overlaps at most three intervals in $I_q$.
  • Figure 5: Illustration of the algorithm scheme for building sub-runs in $\texttt{SubIF}_j$ and an example. In this example, $\texttt{SubIF}_{j+1} = \{\,[1,2], [3,3], [4,5], [6,7], [8,9], [10,10], [11,13], [14,14], [15,16]\,\}$ and $\texttt{intervals}_j = \{\,[1,5], [6,6], [7,16]\,\}$. The bijective function $\texttt{foreL}_{j}$ maps the list $\texttt{intervals}_{j}$ to the list $\{\,[1,1], [2,11], [12, 16]\,\}$. Intervals highlighted in the same color contain the same haplotype indices and indicate corresponding pairs under this bijection. After applying the normalization algorithm to $\texttt{foreL}_j(\texttt{intervals}_j)$ and $\texttt{SubIF}_{j+1}$, the intervals in $\texttt{foreL}_j(\texttt{intervals}_j)$ are partitioned into $\texttt{SubIF}'_j = \{\,[1,1], [2,5], [6,10], [11,11], [12,16]\,\}$. Each interval in $\texttt{SubIF}'_{j+1}$ overlaps with at most three intervals in $\texttt{SubIF}_{j+1}$. Finally, by setting $\texttt{SubIF}_j = \texttt{backL}_{j+1}(\texttt{SubIF}'_{j+1})$, the intervals $\texttt{intervals}_j$ are partitioned into $\texttt{SubIF}_j = \{[1,5], [6,6], [7,10], [11,15], [16,16]\}$.
  • Figure 6: Example illustrating the data structure and algorithm for $\texttt{fore}$ queries. The fourth interval $[11,15]$ in $\texttt{SubIF}_j$ corresponds to the third interval $[6,10]$ in $\texttt{foreL}_j(\texttt{SubIF}_j)$, which overlaps three intervals in $\texttt{SubIF}_{j+1}$: the fourth $[6,7]$, the fifth $[8,9]$, and the sixth $[10,10]$. Accordingly, the data structure $F^4_j$ built for the interval $[11,15]$ stores the quintuples $\{(11,6,6,7,4), (11,6,8,9,5), (11,6,10,10,6)\}$ in the form $(s', \tilde{s}, s, t, \lambda)$. Given a query $\texttt{fore}[14][j]$ with index $4$, the algorithm finds the tuple $(s'=11, \tilde{s}=6, s=8, t=9, \lambda=5)$ in $F^4_j$ (since $14 - s' + \tilde{s} = 14 - 11 + 6 = 9 \in [8,9]$) and returns $14 - s' + \tilde{s} = 9$ and $\lambda = 5$ as the answer.
  • ...and 1 more figures

Theorems & Definitions (16)

  • Lemma 1
  • Lemma 2
  • Proposition 3
  • Lemma 4
  • Corollary 5
  • Lemma 6
  • Lemma 7
  • Theorem 8
  • Definition 9: The Three-Overlap Constraint
  • Lemma 10
  • ...and 6 more