Space-Efficient Indexes for Uncertain Strings

Esteban Gabory; Chang Liu; Grigorios Loukides; Solon P. Pissis; Wiktor Zuba

Space-Efficient Indexes for Uncertain Strings

Esteban Gabory, Chang Liu, Grigorios Loukides, Solon P. Pissis, Wiktor Zuba

TL;DR

An index of <tex>$Q (n/ \log z)$</tex> expected size is proposed, which can be constructed using <tex>$Q (n/ \log z)$</tex> expected space, and supports very fast pattern matching queries in expectation, for patterns of length m ≥ ℓ.

Abstract

Strings in the real world are often encoded with some level of uncertainty. In the character-level uncertainty model, an uncertain string $X$ of length $n$ on an alphabet $Σ$ is a sequence of $n$ probability distributions over $Σ$. Given an uncertain string $X$ and a weight threshold $\frac{1}{z}\in(0,1]$, we say that pattern $P$ occurs in $X$ at position $i$, if the product of probabilities of the letters of $P$ at positions $i,\ldots,i+|P|-1$ is at least $\frac{1}{z}$. While indexing standard strings for online pattern searches can be performed in linear time and space, indexing uncertain strings is much more challenging. Specifically, the state-of-the-art index for uncertain strings has $\mathcal{O}(nz)$ size, requires $\mathcal{O}(nz)$ time and $\mathcal{O}(nz)$ space to be constructed, and answers pattern matching queries in the optimal $\mathcal{O}(m+|\text{Occ}|)$ time, where $m$ is the length of $P$ and $|\text{Occ}|$ is the total number of occurrences of $P$ in $X$. For large $n$ and (moderate) $z$ values, this index is completely impractical to construct, which outweighs the benefit of the supported optimal pattern matching queries. We were thus motivated to design a space-efficient index at the expense of slower yet competitive pattern matching queries. We propose an index of $\mathcal{O}(\frac{nz}{\ell}\log z)$ expected size, which can be constructed using $\mathcal{O}(\frac{nz}{\ell}\log z)$ expected space, and supports very fast pattern matching queries in expectation, for patterns of length $m\geq \ell$. We have implemented and evaluated several versions of our index. The best-performing version of our index is up to two orders of magnitude smaller than the state of the art in terms of both index size and construction space, while offering faster or very competitive query and construction times.

Space-Efficient Indexes for Uncertain Strings

TL;DR

An index of <tex>

</tex> expected size is proposed, which can be constructed using <tex>

</tex> expected space, and supports very fast pattern matching queries in expectation, for patterns of length m ≥ ℓ.

Abstract

Strings in the real world are often encoded with some level of uncertainty. In the character-level uncertainty model, an uncertain string

of length

on an alphabet

is a sequence of

probability distributions over

. Given an uncertain string

and a weight threshold

, we say that pattern

occurs in

at position

, if the product of probabilities of the letters of

at positions

is at least

. While indexing standard strings for online pattern searches can be performed in linear time and space, indexing uncertain strings is much more challenging. Specifically, the state-of-the-art index for uncertain strings has

size, requires

time and

space to be constructed, and answers pattern matching queries in the optimal

time, where

is the length of

and

is the total number of occurrences of

. For large

and (moderate)

values, this index is completely impractical to construct, which outweighs the benefit of the supported optimal pattern matching queries. We were thus motivated to design a space-efficient index at the expense of slower yet competitive pattern matching queries. We propose an index of

expected size, which can be constructed using

expected space, and supports very fast pattern matching queries in expectation, for patterns of length

. We have implemented and evaluated several versions of our index. The best-performing version of our index is up to two orders of magnitude smaller than the state of the art in terms of both index size and construction space, while offering faster or very competitive query and construction times.

Paper Structure (15 sections, 12 theorems, 1 equation, 12 figures, 2 tables, 4 algorithms)

This paper contains 15 sections, 12 theorems, 1 equation, 12 figures, 2 tables, 4 algorithms.

Introduction
Our Data Model and Motivation
Our Techniques and Results
Paper Organization
Preliminaries and Problem Definition
The New Index: Minimizer-based WST
Space-efficient Construction of the Index
Practically Fast Querying Without a Grid
Related Work
Experimental Evaluation
Data and Setup
Evaluating our Minimizer-based Indexes
Evaluating our Space-efficient Index Construction
Conclusion of our Experimental Evaluation
Discussion: Limitations and Future Work

Key Result

Lemma 1

The density of an $(\ell,k)$-minimizer scheme on alphabet $\Sigma$ with $k\ge \log_{|\Sigma|} \ell+c$ is $\mathcal{O}(\frac{1}{\ell})$, for some $c=\mathcal{O}(1)$.

Figures (12)

Figure 1: An informal overview of our techniques: Given a weighted string $X$ of length $n$ over an alphabet of size $\sigma$, a weight threshold $\frac{1}{z}$, and an integer $\ell$, we (A) use the algorithm from DBLP:conf/cpm/BartonKPR16 to construct a family $\mathcal{S}$ of $z$ standard strings, each of length $n$. (B) For each such string, we consider all of its $n$ suffixes and sample them for the given integer $\ell$ using the minimizers mechanism DBLP:journals/bioinformatics/RobertsHHMY04DBLP:conf/sigmod/SchleimerWA03. These suffixes imply a set of $\mathcal{O}(n/\ell)$ suffixes and a set of $\mathcal{O}(n/\ell)$ reversed suffixes in expectation. (C) We then index these suffixes in two suffix trees DBLP:conf/focs/Weiner73, which we link using a 2D grid, so as to answer pattern matching queries for patterns of length at least $\ell$. (D) When such a pattern $P$ of length $m$ is given, we find its leftmost minimizer, which implies a suffix and a reversed suffix of $P$, and query those using our index. Our index efficiently merges the partial results (i.e., occurrences of suffixes and reversed suffixes); and (E) outputs all $z$-valid occurrences of $P$ in $X$. The size of the resulting index is $\mathcal{O}(\frac{nz}{\ell}\log z)$. The extra multiplicative $\log z$ factor comes from our representation of the edge labels in the suffix trees.
Figure 1: A $4$-estimation $\mathcal{S}$ of $X$ from \ref{['ex:weighted_string']}.
Figure 2: Suffix tree of $S=\texttt{CAGAGA\$}$.
Figure 3: Our minimizer-based index for the weighted string from \ref{['ex:weighted_string']}, $\frac{1}{z}=\frac{1}{4}$, and the minimizers from Example \ref{['ex:2d-structure']}. $\mathcal{T}_{\text{suff}}$ is the forward minimizer solid factor tree and $\mathcal{T}_{\text{pref}}$ is the backward one. Edges without labels are constructed for readability and mean that the parent and the children nodes correspond to the same string. Each leaf node representing the minimizer position $i$ in string $S_j$ is decorated with $(i,j)$.
Figure 4: Construct-$\mathcal{T}$
...and 7 more figures

Theorems & Definitions (27)

Example 1
Example 2
Definition 1
Lemma 1: 10.1093/bioinformatics/btaa472
Example 3
Theorem 2: DBLP:journals/iandc/BartonK0PR20
Example 4
Definition 2
Example 5
Lemma 3: DBLP:journals/mst/KociumakaPR19
...and 17 more

Space-Efficient Indexes for Uncertain Strings

TL;DR

Abstract

Space-Efficient Indexes for Uncertain Strings

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (27)