Table of Contents
Fetching ...

Bounding the Average Move Structure Query for Faster and Smaller RLBWT Permutations

Nathaniel K. Brown, Ben Langmead

TL;DR

This work addresses move-structure queries in RLBWT-based compressed indexes by introducing length capping, a simple interval-splitting technique that bounds average-case query time to the optimal level while improving construction time. The authors prove that capping intervals at $L = c \cdot \frac{n}{r}$ yields at most $r' \le r + \frac{r}{c}$ intervals and $O(1)$ amortized move queries over a single cycle, with a space bound of $O(r \log r + r \log \frac{n}{r})$ bits and worst-case $O(\log \frac{n}{r})$ time. They show that length-capped move structures enable optimal-time BWT inversion and SA/DA enumeration in $O(n)$ time using $O(r)$ extra space, and provide the RunPerm library to evaluate these ideas in practice. Experiments on large genomic collections demonstrate substantial space reductions (e.g., ~40-46% for LF) and faster average queries, particularly when length-capping is combined with balancing, indicating strong practical impact for pangenome-scale indexes.

Abstract

The move structure represents permutations with long contiguously permuted intervals in compressed space with optimal query time. They have become an important feature of compressed text indexes using space proportional to the number of Burrows-Wheeler Transform (BWT) runs, often applied in genomics. This is in thanks not only to theoretical improvements over past approaches, but great cache efficiency and average case query time in practice. This is true even without using the worst case guarantees provided by the interval splitting balancing of the original result. In this paper, we show that an even simpler type of splitting, length capping by truncating long intervals, bounds the average move structure query time to optimal whilst obtaining a superior construction time than the traditional approach. This also proves constant query time when amortized over a full traversal of a single cycle permutation from an arbitrary starting position. Such a scheme has surprising benefits both in theory and practice. We leverage the approach to improve the representation of any move structure with $r$ runs over a domain $n$ to $O(r \log r + r \log \frac{n}{r})$-bits of space. The worst case query time is also improved to $O(\log \frac{n}{r})$ without balancing. An $O(r)$-time and $O(r)$-space construction lets us apply the method to run-length encoded BWT (RLBWT) permutations such as LF and $φ$ to obtain optimal-time algorithms for BWT inversion and suffix array (SA) enumeration in $O(r)$ additional working space. Finally, we provide the RunPerm library, providing flexible plug and play move structure support, and use it to evaluate our splitting approach. Experiments find length capping results in faster move structures, but also a space reduction: at least $\sim 40\%$ for LF across large repetitive genomic collections.

Bounding the Average Move Structure Query for Faster and Smaller RLBWT Permutations

TL;DR

This work addresses move-structure queries in RLBWT-based compressed indexes by introducing length capping, a simple interval-splitting technique that bounds average-case query time to the optimal level while improving construction time. The authors prove that capping intervals at yields at most intervals and amortized move queries over a single cycle, with a space bound of bits and worst-case time. They show that length-capped move structures enable optimal-time BWT inversion and SA/DA enumeration in time using extra space, and provide the RunPerm library to evaluate these ideas in practice. Experiments on large genomic collections demonstrate substantial space reductions (e.g., ~40-46% for LF) and faster average queries, particularly when length-capping is combined with balancing, indicating strong practical impact for pangenome-scale indexes.

Abstract

The move structure represents permutations with long contiguously permuted intervals in compressed space with optimal query time. They have become an important feature of compressed text indexes using space proportional to the number of Burrows-Wheeler Transform (BWT) runs, often applied in genomics. This is in thanks not only to theoretical improvements over past approaches, but great cache efficiency and average case query time in practice. This is true even without using the worst case guarantees provided by the interval splitting balancing of the original result. In this paper, we show that an even simpler type of splitting, length capping by truncating long intervals, bounds the average move structure query time to optimal whilst obtaining a superior construction time than the traditional approach. This also proves constant query time when amortized over a full traversal of a single cycle permutation from an arbitrary starting position. Such a scheme has surprising benefits both in theory and practice. We leverage the approach to improve the representation of any move structure with runs over a domain to -bits of space. The worst case query time is also improved to without balancing. An -time and -space construction lets us apply the method to run-length encoded BWT (RLBWT) permutations such as LF and to obtain optimal-time algorithms for BWT inversion and suffix array (SA) enumeration in additional working space. Finally, we provide the RunPerm library, providing flexible plug and play move structure support, and use it to evaluate our splitting approach. Experiments find length capping results in faster move structures, but also a space reduction: at least for LF across large repetitive genomic collections.
Paper Structure (18 sections, 11 theorems, 2 equations, 2 figures, 3 tables)

This paper contains 18 sections, 11 theorems, 2 equations, 2 figures, 3 tables.

Key Result

Theorem 1

Given a permutation $\pi$ over $[0,n-1]$, let $S$ be the set of $r$ positions such that either $i=0$ or $\pi(i-1) \neq \pi(i)-1$. We can construct in $O(r \log r)$-time and $O(r)$-space bertram2024move a balanced move structure, which, given position $i$ and the rank of its predecessor in $S$, compu

Figures (2)

  • Figure 1: For an example permutation $\pi$, shows the corresponding relationship between its contiguously permuted runs and the move structure components $S$, $S_\pi$, and $S_{rank(\pi)}$.
  • Figure 2: Left: The time per move query in nanoseconds across collections of chr19 haplotypes. Right: The respective disk size in megabytes for the corresponding move structures using RunPerm.

Theorems & Definitions (11)

  • Theorem 1: Nishimoto and Tabei nishimoto2021optimal
  • Theorem 2
  • Corollary 3
  • Theorem 4
  • Corollary 5
  • Lemma 6
  • Lemma 7
  • Lemma 8
  • Theorem 9
  • Theorem 10
  • ...and 1 more