Table of Contents
Fetching ...

Tight bounds for the sensitivity of CDAWGs with left-end edits

Hiroto Fujimaru, Yuto Nakashima, Shunsuke Inenaga

TL;DR

This work investigates how the size of a CDAWG changes when a single character edit is applied at the left end of the input string. It develops tight additive bounds on the increase in the number of edges, $e' - e$, for left-end insertions, deletions, and substitutions, with the strongest results for insertions: $AS_{ ext{LeftIns}}(\mathsf{e},n) \le \mathsf{e}$ and a matching lower bound $AS_{ ext{LeftIns}}(\mathsf{e},n) \ge \mathsf{e}-1$, plus near-tight bounds for the other edits. The paper also extends these insights to leftward online construction, proving a quadratic-time lower bound $\Omega(n^2)$ for updating the CDAWG as the string is prepended leftward, both in the plain online and batched settings. Overall, the results establish robust additive sensitivity bounds for CDAWGs under left-end edits and reveal fundamental limits on leftward maintenance algorithms. These findings have implications for CDAWG-based indexing and compression in the presence of edits and errors, and motivate extending the sensitivity framework to arbitrary-position edits and CDAWG-grammar sizes.

Abstract

Compact directed acyclic word graphs (CDAWGs) [Blumer et al. 1987] are a fundamental data structure on strings with applications in text pattern searching, data compression, and pattern discovery. Intuitively, the CDAWG of a string $T$ is obtained by merging isomorphic subtrees of the suffix tree [Weiner 1973] of the same string $T$, thus CDAWGs are a compact indexing structure. In this paper, we investigate the sensitivity of CDAWGs when a single character edit operation (insertion, deletion, or substitution) is performed at the left-end of the input string $T$, namely, we are interested in the worst-case increase in the size of the CDAWG after a left-end edit operation. We prove that if $e$ is the number of edges of the CDAWG for string $T$, then the number of new edges added to the CDAWG after a left-end edit operation on $T$ does not exceed $e$. Further, we present a matching lower bound on the sensitivity of CDAWGs for left-end insertions, and almost matching lower bounds for left-end deletions and substitutions. We then generalize our lower-bound instance for left-end insertions to leftward online construction of the CDAWG, and show that it requires $Ω(n^2)$ time for some string of length $n$.

Tight bounds for the sensitivity of CDAWGs with left-end edits

TL;DR

This work investigates how the size of a CDAWG changes when a single character edit is applied at the left end of the input string. It develops tight additive bounds on the increase in the number of edges, , for left-end insertions, deletions, and substitutions, with the strongest results for insertions: and a matching lower bound , plus near-tight bounds for the other edits. The paper also extends these insights to leftward online construction, proving a quadratic-time lower bound for updating the CDAWG as the string is prepended leftward, both in the plain online and batched settings. Overall, the results establish robust additive sensitivity bounds for CDAWGs under left-end edits and reveal fundamental limits on leftward maintenance algorithms. These findings have implications for CDAWG-based indexing and compression in the presence of edits and errors, and motivate extending the sensitivity framework to arbitrary-position edits and CDAWG-grammar sizes.

Abstract

Compact directed acyclic word graphs (CDAWGs) [Blumer et al. 1987] are a fundamental data structure on strings with applications in text pattern searching, data compression, and pattern discovery. Intuitively, the CDAWG of a string is obtained by merging isomorphic subtrees of the suffix tree [Weiner 1973] of the same string , thus CDAWGs are a compact indexing structure. In this paper, we investigate the sensitivity of CDAWGs when a single character edit operation (insertion, deletion, or substitution) is performed at the left-end of the input string , namely, we are interested in the worst-case increase in the size of the CDAWG after a left-end edit operation. We prove that if is the number of edges of the CDAWG for string , then the number of new edges added to the CDAWG after a left-end edit operation on does not exceed . Further, we present a matching lower bound on the sensitivity of CDAWGs for left-end insertions, and almost matching lower bounds for left-end deletions and substitutions. We then generalize our lower-bound instance for left-end insertions to leftward online construction of the CDAWG, and show that it requires time for some string of length .
Paper Structure (21 sections, 21 theorems, 23 equations, 6 figures, 1 table)

This paper contains 21 sections, 21 theorems, 23 equations, 6 figures, 1 table.

Key Result

Lemma 1

If $ax \notin \mathsf{M}(T)$ and $ax \in \mathsf{M}(aT)$ (i.e. $ax$ is a new node in $\mathsf{CDAWG}(aT)$), then $x \in \mathsf{M}(T)$. Also, $\mathsf{d}_{aT}(ax) \leq \mathsf{d}_{T}(x)$.

Figures (6)

  • Figure 1: Illustration for $\mathsf{CDAWG}(T)$ of string $T=(ab)^4 c(ab)^3$. Every substring of $T$ can be spelled out from a distinct path from the source $\varepsilon$. There is a one-to-one correspondence between the maximal substrings in $\mathsf{M}(T) = \{\varepsilon, ab, (ab)^2, (ab)^3, (ab)^4 c(ab)^3\}$ and the nodes of $\mathsf{CDAWG}(T)$. The number of right-extensions of $\mathsf{CDAWG}(T)$ is the number $\mathsf{e}(T)$ of edges, which is 9 in this example.
  • Figure 2: Illustration for the CDAWGs of strings $T=(ab)^3 abc(ab)^3$ and $T'= bT=b(ab)^3 abc (ab)^3$ with $m = 3$. The omitted edge labels are all $c(ab)^4$. Observe that $\mathsf{e}(T) = 9$ and $\mathsf{e}(T') = 17$, and hence $\mathsf{e}(T')-\mathsf{e}(T) = 8 = \mathsf{e}(T)-1$ with this left-end insertion.
  • Figure 3: Illustration for the CDAWGs of strings $T=(ab)^4 abc(ab)^3$ and $T'= T[2..n]=b(ab)^3 c (ab)^3$ with $m = 3$. The omitted edge labels are all $c(ab)^3$. Observe that $\mathsf{e}(T) = 9$, $\mathsf{e}(T') = 14$, and hence $\mathsf{e}(T')-\mathsf{e}(T) = 5 = \mathsf{e}(T)-4$ with this left-end deletion.
  • Figure 4: Illustration for the CDAWGs of strings $T=(ab)^4 c(ab)^3$ and $T'= bT[2..n]=bb(ab)^3 c (ab)^3$ with $m = 3$. The omitted edge labels are all $c(ab)^3$. Observe that $\mathsf{e}(T) = 9$, $\mathsf{e}(T') = 15$, and hence $\mathsf{e}(T')-\mathsf{e}(T) = 6 = \mathsf{e}(T)-3$ with this left-end deletion.
  • Figure 5: Illustration for the CDAWGs of strings $T_{k,m}=(ab)^3 cab(ab)^4\$$, $bT_{k,m}=b(ab)^3 cab(ab)^4\$$, and $T_{k+1,m}=(ab)^4 cab (ab)^4\$$ with $k = 1, m = 2$.
  • ...and 1 more figures

Theorems & Definitions (37)

  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • Lemma 5
  • proof
  • Lemma 6
  • ...and 27 more