Nearly Optimal Internal Dictionary Matching

Jingbang Chen; Jiangqi Dai; Qiuyang Mang; Qingyu Shi; Tingqiang Xu

Nearly Optimal Internal Dictionary Matching

Jingbang Chen, Jiangqi Dai, Qiuyang Mang, Qingyu Shi, Tingqiang Xu

TL;DR

The paper addresses internal dictionary matching (IDM) by introducing Basic Substring Structure (BASS), a linear-space framework that supports static dictionaries and enables near-optimal IDM queries on a text of length $n$ with dictionary size $d$. BASS organizes substrings via a grid and equivalence classes, leveraging PreTree and SufTree constructions to map blocks to efficient query structures, including a near-linear 2D range-counting backbone. The authors achieve improved time bounds across IDM queries: CountDistinct in $O(\log n)$ time with $O(n\log^2 n + d)$ space, Count in $O(\frac{\log n}{\log\log n})$ time with $O(n+d\sqrt{\log n})$ preprocessing, and ReportDistinct in $O(1+|output|)$ time with $O(n+d)$ preprocessing, while Exists/Report attain optimal $O(1+|output|)$ behavior. They also extend the approach to Range Longest Common Substring with constant-time queries and discuss potential applications to other internal query problems, highlighting BASS as a unifying and broadly applicable tool for IDM and related string-processing tasks.

Abstract

We study the internal dictionary matching (IDM) problem where a dictionary $\mathcal{D}$ containing $d$ substrings of a text $T$ is given, and each query concerns the occurrences of patterns in $\mathcal{D}$ in another substring of $T$. We propose a novel $O(n)$-sized data structure named Basic Substring Structure (BASS) where $n$ is the length of the text $T.$ With BASS, we are able to handle all types of queries in the IDM problem in nearly optimal query and preprocessing time. Specifically, our results include: $\bullet$ The first algorithm that answers the CountDistinct query in $\tilde{O}(1)$ time with $\tilde{O}(n+d)$ preprocessing, where we need to compute the number of distinct patterns that exist in $T[l,r]$. Previously, the best result was $\tilde{O}(m)$ time per query after $\tilde{O}(n^2/m+d)$ or $\tilde{O}(nd/m+d)$ preprocessing, where $m$ is a chosen parameter. $\bullet$ Faster algorithms for two other types of internal queries. We improve the runtime for (1) Occurrence counting (Count) queries to $O(\log n/\log\log n)$ time per query with $O(n+d\sqrt{\log n})$ preprocessing from $O(\log^2 n/\log\log n)$ time per query with $O(n\log n/\log \log n+d\log^{3/2} n)$ preprocessing. (2) Distinct pattern reporting (ReportDistinct) queries to $O(1+|\text{output}|)$ time per query from $O(\log n+|\text{output}|)$ per query. In addition, we match the optimal runtime in the remaining two types of queries, pattern existence (Exists), and occurrence reporting (Report). We also show that BASS is more generally applicable to other internal query problems.

Nearly Optimal Internal Dictionary Matching

TL;DR

with dictionary size

. BASS organizes substrings via a grid and equivalence classes, leveraging PreTree and SufTree constructions to map blocks to efficient query structures, including a near-linear 2D range-counting backbone. The authors achieve improved time bounds across IDM queries: CountDistinct in

time with

space, Count in

time with

preprocessing, and ReportDistinct in

time with

preprocessing, while Exists/Report attain optimal

behavior. They also extend the approach to Range Longest Common Substring with constant-time queries and discuss potential applications to other internal query problems, highlighting BASS as a unifying and broadly applicable tool for IDM and related string-processing tasks.

Abstract

We study the internal dictionary matching (IDM) problem where a dictionary

containing

substrings of a text

is given, and each query concerns the occurrences of patterns in

in another substring of

. We propose a novel

-sized data structure named Basic Substring Structure (BASS) where

is the length of the text

With BASS, we are able to handle all types of queries in the IDM problem in nearly optimal query and preprocessing time. Specifically, our results include:

The first algorithm that answers the CountDistinct query in

time with

preprocessing, where we need to compute the number of distinct patterns that exist in

. Previously, the best result was

time per query after

preprocessing, where

is a chosen parameter.

Faster algorithms for two other types of internal queries. We improve the runtime for (1) Occurrence counting (Count) queries to

time per query with

preprocessing from

time per query with

preprocessing. (2) Distinct pattern reporting (ReportDistinct) queries to

time per query from

per query. In addition, we match the optimal runtime in the remaining two types of queries, pattern existence (Exists), and occurrence reporting (Report). We also show that BASS is more generally applicable to other internal query problems.

Paper Structure (31 sections, 36 theorems, 9 equations, 3 figures, 2 tables)

This paper contains 31 sections, 36 theorems, 9 equations, 3 figures, 2 tables.

Introduction
Our Results
Counting Distinct Patterns
Other Internal Queries
Organization of This Paper
Preliminaries
Trie
Suffix Tree
Basic Substring Structure
Grid
Equivalence Class
Relationships between Blocks
Optimal Algorithm for Occurrence Counting
On Querying Distinct Patterns
Problem Decomposition
...and 16 more sections

Key Result

Theorem 1

CountDistinct$(i, j)$ can be answered in $O(\log n)$ time with a data structure of $O(n\log^2 n + d)$ size that can be constructed in $O(n\log^2 n + d\sqrt{\log n})$ time.

Figures (3)

Figure 1: Two illustrative examples: (a)$\mathcal{E}(\{\texttt{ab}, \texttt{ba}, \texttt{bb}\})$; (b)$\mathcal{T}(\texttt{abbab})$.
Figure 2: (a) The grid of the string $\texttt{abbab}$. (b) The blocks correspond to each equivalence class of the string, where each color corresponds to the blocks of an equivalence class.
Figure 3: Blocks and $\texttt{PreTree}(T)$

Theorems & Definitions (45)

Theorem 1: CountDistinct
Theorem 2: Count
Theorem 3: ReportDistinct
Definition 4: Substring
Remark 5
Definition 6: Prefix and Suffix
Definition 7: Occurrence Set
Theorem 8: farach2000sorting
Definition 9: Extension
Lemma 10
...and 35 more

Nearly Optimal Internal Dictionary Matching

TL;DR

Abstract

Nearly Optimal Internal Dictionary Matching

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (45)