Table of Contents
Fetching ...

U-index: A Universal Indexing Framework for Matching Long Patterns

Lorraine A. K. Ayad, Gabriele Fici, Ragnar Groot Koerkamp, Grigorios Loukides, Rob Patro, Giulio Ermanno Pibiri, Solon P. Pissis

TL;DR

This work introduces the U-index, a universal, sketch-based framework for long-pattern text indexing that places a lightweight sketch of the text atop the original data and builds an index on that sketch. By sketching both the text and queries with minimizers (or other locally consistent sketches) and then verifying candidate matches in the original text, the approach enables fast construction and substantially reduced index size while preserving competitive query performance for long patterns ($m\ge\ell$). The framework is agnostic to the underlying index on the sketch (e.g., suffix array, FM-index), and provides formal performance guarantees that relate construction time and space to the sketch size $z$ and a tunable parameter $\tau$ for encoding the sketch. Experimental results demonstrate substantial space and construction-time benefits over traditional indexes on DNA, protein, and English corpora, with practical effectiveness highlighted by long-read mapping experiments in computational biology. The work suggests that sketch-based universal indexes are promising where queries are long and datasets are large, offering a flexible, modular path to scalable text search.

Abstract

Text indexing is a fundamental and well-studied problem. Classic solutions either replace the original text with a compressed representation, e.g., the FM-index and its variants, or keep it uncompressed but attach some redundancy - an index - to accelerate matching. The former solutions thus retain excellent compressed space, but are slow in practice. The latter approaches, like the suffix array, instead sacrifice space for speed. We show that efficient text indexing can be achieved using just a small extra space on top of the original text, provided that the query patterns are sufficiently long. More specifically, we develop a new indexing paradigm in which a sketch of a query pattern is first matched against a sketch of the text. Once candidate matches are retrieved, they are verified using the original text. This paradigm is thus universal in the sense that it allows us to use any solution to index the sketched text, like a suffix array, FM-index, or r-index. We explore both the theory and the practice of this universal framework. With an extensive experimental analysis, we show that, surprisingly, universal indexes can be constructed much faster than their unsketched counterparts and take a fraction of the space, as a direct consequence of (i) having a lower bound on the length of patterns and (ii) working in sketch space. Furthermore, these data structures have the potential of retaining or even improving query time, because matching against the sketched text is faster and verifying candidates can be theoretically done in constant time per occurrence (or, in practice, by short and cache-friendly scans of the text). Finally, we discuss some important applications of this novel indexing paradigm to computational biology. We hypothesize that such indexes will be particularly effective when the queries are sufficiently long, and so demonstrate applications in long-read mapping.

U-index: A Universal Indexing Framework for Matching Long Patterns

TL;DR

This work introduces the U-index, a universal, sketch-based framework for long-pattern text indexing that places a lightweight sketch of the text atop the original data and builds an index on that sketch. By sketching both the text and queries with minimizers (or other locally consistent sketches) and then verifying candidate matches in the original text, the approach enables fast construction and substantially reduced index size while preserving competitive query performance for long patterns (). The framework is agnostic to the underlying index on the sketch (e.g., suffix array, FM-index), and provides formal performance guarantees that relate construction time and space to the sketch size and a tunable parameter for encoding the sketch. Experimental results demonstrate substantial space and construction-time benefits over traditional indexes on DNA, protein, and English corpora, with practical effectiveness highlighted by long-read mapping experiments in computational biology. The work suggests that sketch-based universal indexes are promising where queries are long and datasets are large, offering a flexible, modular path to scalable text search.

Abstract

Text indexing is a fundamental and well-studied problem. Classic solutions either replace the original text with a compressed representation, e.g., the FM-index and its variants, or keep it uncompressed but attach some redundancy - an index - to accelerate matching. The former solutions thus retain excellent compressed space, but are slow in practice. The latter approaches, like the suffix array, instead sacrifice space for speed. We show that efficient text indexing can be achieved using just a small extra space on top of the original text, provided that the query patterns are sufficiently long. More specifically, we develop a new indexing paradigm in which a sketch of a query pattern is first matched against a sketch of the text. Once candidate matches are retrieved, they are verified using the original text. This paradigm is thus universal in the sense that it allows us to use any solution to index the sketched text, like a suffix array, FM-index, or r-index. We explore both the theory and the practice of this universal framework. With an extensive experimental analysis, we show that, surprisingly, universal indexes can be constructed much faster than their unsketched counterparts and take a fraction of the space, as a direct consequence of (i) having a lower bound on the length of patterns and (ii) working in sketch space. Furthermore, these data structures have the potential of retaining or even improving query time, because matching against the sketched text is faster and verifying candidates can be theoretically done in constant time per occurrence (or, in practice, by short and cache-friendly scans of the text). Finally, we discuss some important applications of this novel indexing paradigm to computational biology. We hypothesize that such indexes will be particularly effective when the queries are sufficiently long, and so demonstrate applications in long-read mapping.

Paper Structure

This paper contains 44 sections, 3 theorems, 3 equations, 6 figures.

Key Result

Theorem 1

When $T$ is a string of i.i.d. random characters and $k > (3+\varepsilon)\log_{\sigma}(w+1)$ for any $\varepsilon > 0$, the density of the random minimizer is $2/(w+1) + o(1/w)$.

Figures (6)

  • Figure 1: The U-index framework. Steps (1) and (2) are to build the index. The steps (3)--(6) are to query with the framework. The sketching scheme in steps (1) and (3) must be the same.
  • Figure 2: An illustration of the U-index of a text $T$, along with a query example. First, the minimizers $M$ of $T$ are found, here of length $k=4$ characters, with two of them overlapping (those starting at positions 7 and 9). The minimizer positions $\mathcal{M}_{\ell,k}\xspace$ are stored with Elias-Fano coding. Minimizers are hashed via $H$ to shorter IDs. These are padded to the next multiple of $\tau$. An index is then built on the sketch $S$. To locate a pattern $P$, its minimizers are found and the sketch $Q$ of corresponding IDs is constructed. Then $Q$ is located in $S$, which here gives a single match. The first minimizer of the match in $Q$ is located in $T$ at position $l$ via $\mathcal{M}_{\ell,k}\xspace$. Lastly, the candidate match is verified starting at position $l-\alpha$ in $T$.
  • Figure 3: The $\mathcal{O}(1)$-time verification algorithm for occurrence $l-\alpha$. After spelling the fragments of $P$ in the two tries once, we check if the fragments in gray match using fingerprints in $\mathcal{O}(1)$ time; if so, we check if the corresponding leaf nodes are both located in the induced subtrees in $\mathcal{O}(1)$ time.
  • Figure 4: Results on the $59$ MiB of the human chromosome 1. For each data structure --- except the sparse SA --- we compare its performance when constructed on the plain input text (in red, left column of each group) versus when used with the U-index (remaining colors and columns), for increasing values of $k$ and $\ell$. Indexes marked with -H (read, "minus H") use minimizers themselves as IDs, without the map $H$. Similarly, the indexes marked with -S omit storing the sketched input text $S$ and instead reconstruct it via the minimizer positions $\mathcal{M}_{\ell,k}\xspace(T)$ and $T$ itself. The sparse SA is only shown with sampling (no red column) because it is otherwise equivalent to SA. The top plot shows the space usage (size) of the final data structure in MiB, with the space for minimizer positions and the map $H$ shaded, and the black line indicating the space occupied by the 2-bit packed input text. The second row shows the maximum memory usage (resident set size) during the construction, where the shaded area is the memory usage before construction. The third row shows the construction time (in seconds), with the time for sketching the input shaded. The bottom plot shows the query time (in average $\mu$s per Locate query), with the time for searching in the inner index shaded.
  • Figure 5: Results on $200$ MiB of protein sequences. Refer to the caption of \ref{['fig:plot-v2']}.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Theorem 1: Theorem 3 from zheng2020improved
  • Theorem 2: DBLP:journals/corr/abs-2407-11819
  • Theorem 3