Table of Contents
Fetching ...

Constant sensitivity on the CDAWGs

Rikuya Hamai, Hiroto Fujimaru, Shunsuke Inenaga

TL;DR

The paper investigates how the size of a Constant Directed Acyclic Word Graph (CDAWG) changes when a single character is edited at an arbitrary position in the input string $T$. The authors develop a purely combinatorial, end-to-end analysis of maximal repeats and their right-extensions around the edited position, partitioning new and existing repeats into sets and bounding their contributions to the CDAWG edge count. They prove a tight constant-factor bound: the post-edit CDAWG size $\mathsf{e}(T')$ satisfies $\mathsf{e}(T') \le (8\mathsf{e}(T)+4)$, i.e., a multiplicative factor of at most $8$ in the limit as the original size $\mathsf{e}(T)$ grows. This establishes that CDAWGs have $O(1)$ multiplicative sensitivity to single-character edits, making them robust against edits and errors; the known lower bound is $2$, leaving a gap for future refinement.

Abstract

Compact directed acyclic word graphs (CDAWGs) [Blumer et al. 1987] are a fundamental data structure on strings with applications in text pattern searching, data compression, and pattern discovery. Intuitively, the CDAWG of a string $T$ is obtained by merging isomorphic subtrees of the suffix tree [Weiner 1973] of the same string $T$, and thus CDAWGs are a compact indexing structure. In this paper, we investigate the sensitivity of CDAWGs when a single character edit operation is performed at an arbitrary position in $T$. We show that the size of the CDAWG after an edit operation on $T$ is asymptotically at most 8 times larger than the original CDAWG before the edit.

Constant sensitivity on the CDAWGs

TL;DR

The paper investigates how the size of a Constant Directed Acyclic Word Graph (CDAWG) changes when a single character is edited at an arbitrary position in the input string . The authors develop a purely combinatorial, end-to-end analysis of maximal repeats and their right-extensions around the edited position, partitioning new and existing repeats into sets and bounding their contributions to the CDAWG edge count. They prove a tight constant-factor bound: the post-edit CDAWG size satisfies , i.e., a multiplicative factor of at most in the limit as the original size grows. This establishes that CDAWGs have multiplicative sensitivity to single-character edits, making them robust against edits and errors; the known lower bound is , leaving a gap for future refinement.

Abstract

Compact directed acyclic word graphs (CDAWGs) [Blumer et al. 1987] are a fundamental data structure on strings with applications in text pattern searching, data compression, and pattern discovery. Intuitively, the CDAWG of a string is obtained by merging isomorphic subtrees of the suffix tree [Weiner 1973] of the same string , and thus CDAWGs are a compact indexing structure. In this paper, we investigate the sensitivity of CDAWGs when a single character edit operation is performed at an arbitrary position in . We show that the size of the CDAWG after an edit operation on is asymptotically at most 8 times larger than the original CDAWG before the edit.

Paper Structure

This paper contains 19 sections, 16 theorems, 6 equations, 17 figures, 1 table.

Key Result

Theorem 1

For any string $T$ of length $n$, $\mathsf{MS}_{\mathrm{Ins}}(\mathsf{\mathsf{e}}, n) \leq (8\mathsf{e}+4)/\mathsf{e}$, $\mathsf{MS}_{\mathrm{Del}}(\mathsf{\mathsf{e}}, n) \leq (8\mathsf{e}+4)/\mathsf{e}$, $\mathsf{MS}_{\mathrm{Sub}}(\mathsf{\mathsf{e}}, n) \leq (8\mathsf{e}+4)/\mathsf{e}$ hold, whe

Figures (17)

  • Figure 1: Illustration for $\mathsf{CDAWG}(T)$ of string $T=\mathrm{(ab)^2 c(ab)^2d}$. The longest strings represented by the nodes of $\mathsf{CDAWG}(T)$ are the maximal substrings in $\mathsf{M}(T) = \{\mathrm{\varepsilon, ab, (ab)^2, (ab)^2 c(ab)^2d}\}$.
  • Figure 2: Illustration of $x_L$ in $T'$ for the case where $x_L$ contains $i$, with insertion and substitution.
  • Figure 3: Illustration for the five cases of a string $x$ when $x_L \notin \mathsf{Prefix}(T')$ and $x_R \notin \mathsf{Suffix}(T')$, where $a \ne c$ and $b \ne d$ for characters $a,b,c,d \in \Sigma$.
  • Figure 4: Illustration for $T'= \mathrm{cabcabcdabca|bcabcdabcabdcabcabcabdabcab}$ in Example \ref{['ex:mr']} and the occurrences of $\mathrm{abcabc}\in\mathsf{N}_{\mathrm{1}}$ and $\mathrm{cabcabcdabcab}\in\mathsf{N}_{\mathrm{3B}}$ in $T'$. The $|$ symbol in $T'$ exhibits the edit position. The solid line boxes exhibit the crossing occurrences of $\mathrm{abcabc}$ and $\mathrm{cabcabcdabcab}$ in $T'$, and the dashed line boxes exhibit the non-crossing occurrences of them in $T'$.
  • Figure 5: Illustration for Lemma \ref{['lem:exist1']} where $i$ is the edited position and $a, b, c$ differ from each other.
  • ...and 12 more figures

Theorems & Definitions (40)

  • Theorem 1
  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Example 1
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • ...and 30 more