Table of Contents
Fetching ...

Linear Index for Logarithmic Search-Time for any String under any Internal Node in Suffix Trees

Anas Al-okaily

TL;DR

The paper tackles the challenge of efficiently searching for a pattern under internal nodes in suffix trees, a problem central to approximate pattern matching. It introduces OT_index, a linear-size index built from three sub-indexes (base paths, Hanadi nodes, Srivastava nodes) using OSHR-tree concepts, enabling $O(\log_2 n)$ search time via binary search on per-node lists. Key contributions include the formal definitions of Hanadi and Srivastava nodes, a concrete construction algorithm, and empirical evidence showing speedups over walking for depths $2$–$10$ with a total index size of $O(\Sigma n)$ (and $O(n)$ for Srivastava). This work offers a scalable, practical approach to enhance internal-node pattern searches in suffix trees, with direct implications for faster approximate pattern matching on large genomic data sets.

Abstract

Suffix trees are key and efficient data structure for solving string problems. A suffix tree is a compressed trie containing all the suffixes of a given text of length $n$ with a linear construction cost. In this work, we introduce an algorithm to build a linear index that allows finding a pattern of any length under any internal node in a suffix tree in O(logn) time.

Linear Index for Logarithmic Search-Time for any String under any Internal Node in Suffix Trees

TL;DR

The paper tackles the challenge of efficiently searching for a pattern under internal nodes in suffix trees, a problem central to approximate pattern matching. It introduces OT_index, a linear-size index built from three sub-indexes (base paths, Hanadi nodes, Srivastava nodes) using OSHR-tree concepts, enabling search time via binary search on per-node lists. Key contributions include the formal definitions of Hanadi and Srivastava nodes, a concrete construction algorithm, and empirical evidence showing speedups over walking for depths with a total index size of (and for Srivastava). This work offers a scalable, practical approach to enhance internal-node pattern searches in suffix trees, with direct implications for faster approximate pattern matching on large genomic data sets.

Abstract

Suffix trees are key and efficient data structure for solving string problems. A suffix tree is a compressed trie containing all the suffixes of a given text of length with a linear construction cost. In this work, we introduce an algorithm to build a linear index that allows finding a pattern of any length under any internal node in a suffix tree in O(logn) time.
Paper Structure (5 sections, 1 figure, 2 tables)

This paper contains 5 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Node 1 in left picture is an example of Hanadi node; node A is a reference leaf node for node 1 (and hence is a reference node for Hanadi node); node 2 is not a Hanadi node as it's an OSHR internal node and it must be already indexed by the index of base paths. Node 1 in right picture is an example of Srivastava node as it's an OSHR leaf node, has reference leaf node (node A), and has reference internal node (node B); node A hence is also a reference node for Srivastava node.