Table of Contents
Fetching ...

Space-Efficient Indexes for Uncertain Strings

Esteban Gabory, Chang Liu, Grigorios Loukides, Solon P. Pissis, Wiktor Zuba

TL;DR

An index of <tex>$Q (n/ \log z)$</tex> expected size is proposed, which can be constructed using <tex>$Q (n/ \log z)$</tex> expected space, and supports very fast pattern matching queries in expectation, for patterns of length m ≥ ℓ.

Abstract

Strings in the real world are often encoded with some level of uncertainty. In the character-level uncertainty model, an uncertain string $X$ of length $n$ on an alphabet $Σ$ is a sequence of $n$ probability distributions over $Σ$. Given an uncertain string $X$ and a weight threshold $\frac{1}{z}\in(0,1]$, we say that pattern $P$ occurs in $X$ at position $i$, if the product of probabilities of the letters of $P$ at positions $i,\ldots,i+|P|-1$ is at least $\frac{1}{z}$. While indexing standard strings for online pattern searches can be performed in linear time and space, indexing uncertain strings is much more challenging. Specifically, the state-of-the-art index for uncertain strings has $\mathcal{O}(nz)$ size, requires $\mathcal{O}(nz)$ time and $\mathcal{O}(nz)$ space to be constructed, and answers pattern matching queries in the optimal $\mathcal{O}(m+|\text{Occ}|)$ time, where $m$ is the length of $P$ and $|\text{Occ}|$ is the total number of occurrences of $P$ in $X$. For large $n$ and (moderate) $z$ values, this index is completely impractical to construct, which outweighs the benefit of the supported optimal pattern matching queries. We were thus motivated to design a space-efficient index at the expense of slower yet competitive pattern matching queries. We propose an index of $\mathcal{O}(\frac{nz}{\ell}\log z)$ expected size, which can be constructed using $\mathcal{O}(\frac{nz}{\ell}\log z)$ expected space, and supports very fast pattern matching queries in expectation, for patterns of length $m\geq \ell$. We have implemented and evaluated several versions of our index. The best-performing version of our index is up to two orders of magnitude smaller than the state of the art in terms of both index size and construction space, while offering faster or very competitive query and construction times.

Space-Efficient Indexes for Uncertain Strings

TL;DR

An index of <tex></tex> expected size is proposed, which can be constructed using <tex></tex> expected space, and supports very fast pattern matching queries in expectation, for patterns of length m ≥ ℓ.

Abstract

Strings in the real world are often encoded with some level of uncertainty. In the character-level uncertainty model, an uncertain string of length on an alphabet is a sequence of probability distributions over . Given an uncertain string and a weight threshold , we say that pattern occurs in at position , if the product of probabilities of the letters of at positions is at least . While indexing standard strings for online pattern searches can be performed in linear time and space, indexing uncertain strings is much more challenging. Specifically, the state-of-the-art index for uncertain strings has size, requires time and space to be constructed, and answers pattern matching queries in the optimal time, where is the length of and is the total number of occurrences of in . For large and (moderate) values, this index is completely impractical to construct, which outweighs the benefit of the supported optimal pattern matching queries. We were thus motivated to design a space-efficient index at the expense of slower yet competitive pattern matching queries. We propose an index of expected size, which can be constructed using expected space, and supports very fast pattern matching queries in expectation, for patterns of length . We have implemented and evaluated several versions of our index. The best-performing version of our index is up to two orders of magnitude smaller than the state of the art in terms of both index size and construction space, while offering faster or very competitive query and construction times.
Paper Structure (15 sections, 12 theorems, 1 equation, 12 figures, 2 tables, 4 algorithms)

This paper contains 15 sections, 12 theorems, 1 equation, 12 figures, 2 tables, 4 algorithms.

Key Result

Lemma 1

The density of an $(\ell,k)$-minimizer scheme on alphabet $\Sigma$ with $k\ge \log_{|\Sigma|} \ell+c$ is $\mathcal{O}(\frac{1}{\ell})$, for some $c=\mathcal{O}(1)$.

Figures (12)

  • Figure 1: An informal overview of our techniques: Given a weighted string $X$ of length $n$ over an alphabet of size $\sigma$, a weight threshold $\frac{1}{z}$, and an integer $\ell$, we (A) use the algorithm from DBLP:conf/cpm/BartonKPR16 to construct a family $\mathcal{S}$ of $z$ standard strings, each of length $n$. (B) For each such string, we consider all of its $n$ suffixes and sample them for the given integer $\ell$ using the minimizers mechanism DBLP:journals/bioinformatics/RobertsHHMY04DBLP:conf/sigmod/SchleimerWA03. These suffixes imply a set of $\mathcal{O}(n/\ell)$ suffixes and a set of $\mathcal{O}(n/\ell)$ reversed suffixes in expectation. (C) We then index these suffixes in two suffix trees DBLP:conf/focs/Weiner73, which we link using a 2D grid, so as to answer pattern matching queries for patterns of length at least $\ell$. (D) When such a pattern $P$ of length $m$ is given, we find its leftmost minimizer, which implies a suffix and a reversed suffix of $P$, and query those using our index. Our index efficiently merges the partial results (i.e., occurrences of suffixes and reversed suffixes); and (E) outputs all $z$-valid occurrences of $P$ in $X$. The size of the resulting index is $\mathcal{O}(\frac{nz}{\ell}\log z)$. The extra multiplicative $\log z$ factor comes from our representation of the edge labels in the suffix trees.
  • Figure 1: A $4$-estimation $\mathcal{S}$ of $X$ from \ref{['ex:weighted_string']}.
  • Figure 2: Suffix tree of $S=\texttt{CAGAGA\$}$.
  • Figure 3: Our minimizer-based index for the weighted string from \ref{['ex:weighted_string']}, $\frac{1}{z}=\frac{1}{4}$, and the minimizers from Example \ref{['ex:2d-structure']}. $\mathcal{T}_{\text{suff}}$ is the forward minimizer solid factor tree and $\mathcal{T}_{\text{pref}}$ is the backward one. Edges without labels are constructed for readability and mean that the parent and the children nodes correspond to the same string. Each leaf node representing the minimizer position $i$ in string $S_j$ is decorated with $(i,j)$.
  • Figure 4: Construct-$\mathcal{T}$
  • ...and 7 more figures

Theorems & Definitions (27)

  • Example 1
  • Example 2
  • Definition 1
  • Lemma 1: 10.1093/bioinformatics/btaa472
  • Example 3
  • Theorem 2: DBLP:journals/iandc/BartonK0PR20
  • Example 4
  • Definition 2
  • Example 5
  • Lemma 3: DBLP:journals/mst/KociumakaPR19
  • ...and 17 more