Table of Contents
Fetching ...

On the sensitivity of CDAWG-grammars

Hiroto Fujimaru, Shunsuke Inenaga

TL;DR

The paper addresses how sensitive the CDAWG-based grammar $\mathsf{G_{CDAWG}}(T)$ is to a single-character edit. By expressing the grammar size as $g(T)=e(T)-\mathrm{v}^{(1)}(T)$ and analyzing how both $e(T)$ and $\mathrm{v}^{(1)}(T)$ vary under edits, the authors bound the edge-change $\mathsf{e}(T')-\mathsf{e}(T)$ via a partition of maximal repeats and crossing occurrences, and bound the in-degree-one node change $\mathrm{v}^{(1)}(T')-\mathrm{v}^{(1)}(T)$ through a detailed analysis of how nodes in $\mathsf{CDAWG}(T)$ can gain or lose in-edges. Combining these bounds yields the main result: the additive sensitivity satisfies $\mathsf{AS}(\mathsf{g},n) \le 4\mathsf{e}(T) + 4$, i.e., the CDAWG-grammar size increases by at most $4e+4$ after a single-character edit. This demonstrates the robustness of the CDAWG-grammar under edits and provides a concrete worst-case bound in terms of the original CDAWG size, with implications for dynamic text compression and substring querying. The work leverages properties of maximal repeats, right-extensions, and the structure of the reversed CDAWG to derive tight combinatorial bounds.

Abstract

The compact directed acyclic word graphs (CDAWG) [Blumer et al. 1987] of a string is the minimal compact automaton that recognizes all the suffixes of the string. CDAWGs are known to be useful for various string tasks including text pattern searching, data compression, and pattern discovery. The CDAWG-grammar [Belazzougui & Cunial 2017] is a grammar-based text compression based on the CDAWG. In this paper, we prove that the CDAWG-grammar size $g$ can increase by at most an additive factor of $4e + 4$ than the original after any single-character edit operation is performed on the input string, where $e$ denotes the number of edges in the corresponding CDAWG before the edit.

On the sensitivity of CDAWG-grammars

TL;DR

The paper addresses how sensitive the CDAWG-based grammar is to a single-character edit. By expressing the grammar size as and analyzing how both and vary under edits, the authors bound the edge-change via a partition of maximal repeats and crossing occurrences, and bound the in-degree-one node change through a detailed analysis of how nodes in can gain or lose in-edges. Combining these bounds yields the main result: the additive sensitivity satisfies , i.e., the CDAWG-grammar size increases by at most after a single-character edit. This demonstrates the robustness of the CDAWG-grammar under edits and provides a concrete worst-case bound in terms of the original CDAWG size, with implications for dynamic text compression and substring querying. The work leverages properties of maximal repeats, right-extensions, and the structure of the reversed CDAWG to derive tight combinatorial bounds.

Abstract

The compact directed acyclic word graphs (CDAWG) [Blumer et al. 1987] of a string is the minimal compact automaton that recognizes all the suffixes of the string. CDAWGs are known to be useful for various string tasks including text pattern searching, data compression, and pattern discovery. The CDAWG-grammar [Belazzougui & Cunial 2017] is a grammar-based text compression based on the CDAWG. In this paper, we prove that the CDAWG-grammar size can increase by at most an additive factor of than the original after any single-character edit operation is performed on the input string, where denotes the number of edges in the corresponding CDAWG before the edit.

Paper Structure

This paper contains 13 sections, 18 theorems, 6 equations, 9 figures.

Key Result

Lemma 1

If $x \in \mathsf{N}_1 \cup \mathsf{N}_{3\mathrm{A}}$, there is no pair $(y, z) \subseteq \mathsf{N}_1 \cup \mathsf{N}_{3\mathrm{A}}$ of distinct strings ($y \neq z$) with $|x| < |y|$ and $|x| < |y|$ such that both $S_{x_L} = S_{{y}_{F}}$ and $S_{x_R}=S_{{z}_{G}}$ hold at the same time, where $F,G \

Figures (9)

  • Figure 1: (a) $\mathsf{CDAWG}(T)$ of string $T=\mathtt{AGAGCGAGCGCGC}\$$ for which $\mathsf{M}(T) = \{\varepsilon, \mathtt{G}, \mathtt{GC}, \mathtt{AG}, \mathtt{GAG}, \mathtt{GCG}, \mathtt{GCGC}, \mathtt{AGAGCG}, T\}$. The number of right-extensions of $\mathsf{CDAWG}(T)$ is the number $\mathsf{e}(T)$ of edges, which is 18 in this example. (b) The reversed DAG $\overline{\mathsf{CDAWG}(T)}$, where each edge is labeled by the initial character and the length of the edge's original label. (c) The derivation tree $\mathcal{T}(T)$ obtained by unfolding $\overline{\mathsf{CDAWG}(T)}$. (d) The grammar rules obtained from $\mathcal{T}(T)$. (e) The resulting grammar $\mathsf{G_{CDAWG}}(T)$ without redundant rules. This grammar size is 13.
  • Figure 2: The CDAWG and CDAWG-grammar for the strings $T$ and $T'$ in Example \ref{['ex:size_diff']}.
  • Figure 3: Illustration of $x_L$ in $T'$ for the case where $x_L$ contains $i$, with insertion and substitution.
  • Figure 4: Illustration for the strings $y$ and $q$ for a string $x \in V_2(T)$ and the relation of their nodes, where the dashed arc represents the suffix link. Any proper suffix $x'$ of $x$ that is longer than $q$ is not a maximal repeat of $T$.
  • Figure 5: Illustration for the proof of Lemma \ref{['lem:x_inedge']}. The case does not occur.
  • ...and 4 more figures

Theorems & Definitions (29)

  • Example 1
  • Definition 1
  • Definition 2
  • Lemma 1: Lemma 2 and 3 for HamaiFI2025
  • Lemma 2: Lemma 4 and 5 for HamaiFI2025
  • Lemma 3
  • Lemma 4
  • Lemma 5: Adapted from HamaiFI2025
  • Lemma 6
  • Lemma 7: Reformulation of Lemma 15 of HamaiFI2025
  • ...and 19 more