Table of Contents
Fetching ...

Minimizers in Semi-Dynamic Strings

Wiktor Zuba, Oded Lachish, Solon P. Pissis

TL;DR

This work addresses efficient computation of minimizers under semi-dynamic string updates, a common need in sequence analysis where the window slides or strings are extended at ends. It introduces a semi-dynamic string model with border modifications and a minimizer data structure that supports $\mathcal{O}(1)$ minimizer queries and amortized $\mathcal{O}(1)$ updates; it also presents a space-efficient variant achieving $\mathcal{O}(\sqrt{w})$ working space with the same asymptotic time, enabling $\mathcal{O}(n)$ computation of $\mathcal{M}_{w,k,\rho}(S)$ in sublinear space. The paper further develops a two-layer (and general multi-layer) framework to bound stored information and rebuild behavior, providing theoretical guarantees and practical performance. An application to minimizers on a weighted trie demonstrates the approach's usefulness in reducing space and time in realistic genomic settings, with experimental results favoring the proposed structures over traditional $\mathcal{O}(w)$-space sliding-window methods.

Abstract

Minimizers sampling is one of the most widely-used mechanisms for sampling strings. Let $S=S[0]\ldots S[n-1]$ be a string over an alphabet $Σ$. In addition, let $w\geq 2$ and $k\geq 1$ be two integers and $ρ=(Σ^k,\leq)$ be a total order on $Σ^k$. The minimizer of window $X=S[i\mathinner{.\,.} i+w+k-2]$ is the smallest position in $[i,i+w-1]$ where the smallest length-$k$ substring of $S[i\mathinner{.\,.} i+w+k-2]$ based on $ρ$ starts. The set of minimizers for all $i\in[0,n-w-k+1]$ is the set $\mathcal{M}_{w,k,ρ}(S)$ of the minimizers of $S$. The set $\mathcal{M}_{w,k,ρ}(S)$ can be computed in $\mathcal{O}(n)$ time. The folklore algorithm for this computation computes the minimizer of every window in $\mathcal{O}(1)$ amortized time using $\mathcal{O}(w)$ working space. It is thus natural to pose the following two questions: Question 1: Can we efficiently support other dynamic updates on the window? Question 2: Can we improve on the $\mathcal{O}(w)$ working space? We answer both questions in the affirmative: 1. We term a string $X$ semi-dynamic when one is allowed to insert or delete a letter at any of its ends. We show a data structure that maintains a semi-dynamic string $X$ and supports minimizer queries in $X$ in $\mathcal{O}(1)$ time with amortized $\mathcal{O}(1)$ time per update operation. 2. We show that this data structure can be modified to occupy strongly sublinear space without increasing the asymptotic complexity of its operations. To the best of our knowledge, this yields the first algorithm for computing $\mathcal{M}_{w,k,ρ}(S)$ in $\mathcal{O}(n)$ time using $\mathcal{O}(\sqrt{w})$ working space. We complement our theoretical results with a concrete application and an experimental evaluation.

Minimizers in Semi-Dynamic Strings

TL;DR

This work addresses efficient computation of minimizers under semi-dynamic string updates, a common need in sequence analysis where the window slides or strings are extended at ends. It introduces a semi-dynamic string model with border modifications and a minimizer data structure that supports minimizer queries and amortized updates; it also presents a space-efficient variant achieving working space with the same asymptotic time, enabling computation of in sublinear space. The paper further develops a two-layer (and general multi-layer) framework to bound stored information and rebuild behavior, providing theoretical guarantees and practical performance. An application to minimizers on a weighted trie demonstrates the approach's usefulness in reducing space and time in realistic genomic settings, with experimental results favoring the proposed structures over traditional -space sliding-window methods.

Abstract

Minimizers sampling is one of the most widely-used mechanisms for sampling strings. Let be a string over an alphabet . In addition, let and be two integers and be a total order on . The minimizer of window is the smallest position in where the smallest length- substring of based on starts. The set of minimizers for all is the set of the minimizers of . The set can be computed in time. The folklore algorithm for this computation computes the minimizer of every window in amortized time using working space. It is thus natural to pose the following two questions: Question 1: Can we efficiently support other dynamic updates on the window? Question 2: Can we improve on the working space? We answer both questions in the affirmative: 1. We term a string semi-dynamic when one is allowed to insert or delete a letter at any of its ends. We show a data structure that maintains a semi-dynamic string and supports minimizer queries in in time with amortized time per update operation. 2. We show that this data structure can be modified to occupy strongly sublinear space without increasing the asymptotic complexity of its operations. To the best of our knowledge, this yields the first algorithm for computing in time using working space. We complement our theoretical results with a concrete application and an experimental evaluation.

Paper Structure

This paper contains 26 sections, 7 theorems, 7 figures.

Key Result

Proposition 2

For any string $S$ of length $n$ over an alphabet $\Sigma$, two integers $w\ge 2$ and $k\ge 1$, and an order $\rho=(\Sigma^k,\leq)$, $\mathcal{M}_{w,k,\rho}(S)$ can be computed in $\mathcal{O}(n)$ time.

Figures (7)

  • Figure 1: Window $X$ (slid over string $S$) with the values for each length-$3$ fragment (for legibility matching their lexicographic order) starting at each position of $X$. Values for which there exists a smaller value to the right are crossed out -- they cannot become a minimizer, as this smaller value will leave the window later. The remaining values form a non-decreasing sequence; notice that, in particular, the last value cannot be crossed out.
  • Figure 2: A graphical representation of the data structure. String $X$ with the values for each length-$3$ fragment (for simplicity matching their lexicographic order) starting at each position of $X$. Position $x$ drawn in red divides $X$ into two parts -- for each one, the "interesting" position-values are drawn in blue; the ones in gray are not represented but they may be computed again during the full structure rebuild. One can notice that the distinguished values can repeat in the left part but not in the right one. On the right, the "interesting" $(\emph{position},\emph{value})$ pairs are stored in two stacks. Note that, in particular, the positions $x$ and $x-1$ are always represented (the stacks are nonempty).
  • Figure 3: A graphical representation of the two-layer data structure for $c=5, x=0$. The "internal" blocks (with darker colour on the figure) are represented by one fragment each (the one with the smallest value). We store one second-layer stack representing the internal blocks on each side, and one first-layer stack for each border block. Note that in the process of updating the structure, there are either one or two such blocks for each side of the string.
  • Figure 4: Illustration of the problem of computing minimizers for all the length-$\ell$ paths of a trie (consisting of a heavy path and small subtrees hanging out), starting from nodes and going towards the root, exactly as considered in DBLP:conf/icde/Gabory0LPZ24. For $(\ell=5, k=2)$, the smallest (in lexicographic order) length-$2$ fragment of $\texttt{GCACT}$ (the length-$5$ string spelled on the blue path starting at the red node) is $\texttt{AC}$, hence the green node is reported as part of the output of the algorithm from DBLP:conf/icde/Gabory0LPZ24.
  • Figure 5: The results on the EFM dataset with $z=32$.
  • ...and 2 more figures

Theorems & Definitions (8)

  • Definition 1: Minimizers
  • Proposition 2: e.g., see DBLP:journals/tkde/LoukidesPS23
  • Lemma 4
  • Lemma 5
  • Theorem 7
  • Lemma 8
  • Theorem 9
  • Corollary 10