Table of Contents
Fetching ...

Computing Minimal Absent Words and Extended Bispecial Factors with CDAWG Space

Shunsuke Inenaga, Takuya Mieno, Hiroki Arimura, Mitsuru Funakoshi, Yuta Fujishige

TL;DR

This work tackles the computation of minimal absent words (MAWs) and related structures using space-efficient compact DAWGs (CDAWGs). It introduces CDAWG-based grammars and an extended longest path tree to report MAWs, extended bispecial factors (EBF), and minimal rare words (MRW) in time linear to the output while using $O(\mathsf{e}_{\min})$ space, where $\mathsf{e}_{\min}=\min\{\mathsf{el}(S),\mathsf{er}(S)\}$. The authors derive tight combinatorial bounds such as $|\mathsf{MAW}(S)| = O(\sigma \mathsf{e}_{\min})$ and $|\mathsf{EBF}(S)| \le \mathsf{er}(S)+\mathsf{el}(S)-|\mathsf{V}|+1$, and introduce length-bounded reporting with LA-free variants. These results yield practical, space-efficient reporting for MAWs, EBFs, and MRWs, with potential impact in bioinformatics and data compression and a framework that links word-structure properties to CDAWG geometry and grammar compression.

Abstract

A string $w$ is said to be a minimal absent word (MAW) for a string $S$ if $w$ does not occur in $S$ and any proper substring of $w$ occurs in $S$. We focus on non-trivial MAWs which are of length at least 2. Finding such non-trivial MAWs for a given string is motivated for applications in bioinformatics and data compression. Fujishige et al. [TCS 2023] proposed a data structure of size $Θ(n)$ that can output the set $\mathsf{MAW}(S)$ of all MAWs for a given string $S$ of length $n$ in $O(n + |\mathsf{MAW}(S)|)$ time, based on the directed acyclic word graph (DAWG). In this paper, we present a more space efficient data structure based on the compact DAWG (CDAWG), which can output $\mathsf{MAW}(S)$ in $O(|\mathsf{MAW}(S)|)$ time with $O(\mathsf{e}_\min)$ space, where $\mathsf{e}_\min$ denotes the minimum of the sizes of the CDAWGs for $S$ and for its reversal $S^R$. For any strings of length $n$, it holds that $\mathsf{e}_\min < 2n$, and for highly repetitive strings $\mathsf{e}_\min$ can be sublinear (up to logarithmic) in $n$. We also show that MAWs and their generalization minimal rare words have close relationships with extended bispecial factors, via the CDAWG.

Computing Minimal Absent Words and Extended Bispecial Factors with CDAWG Space

TL;DR

This work tackles the computation of minimal absent words (MAWs) and related structures using space-efficient compact DAWGs (CDAWGs). It introduces CDAWG-based grammars and an extended longest path tree to report MAWs, extended bispecial factors (EBF), and minimal rare words (MRW) in time linear to the output while using space, where . The authors derive tight combinatorial bounds such as and , and introduce length-bounded reporting with LA-free variants. These results yield practical, space-efficient reporting for MAWs, EBFs, and MRWs, with potential impact in bioinformatics and data compression and a framework that links word-structure properties to CDAWG geometry and grammar compression.

Abstract

A string is said to be a minimal absent word (MAW) for a string if does not occur in and any proper substring of occurs in . We focus on non-trivial MAWs which are of length at least 2. Finding such non-trivial MAWs for a given string is motivated for applications in bioinformatics and data compression. Fujishige et al. [TCS 2023] proposed a data structure of size that can output the set of all MAWs for a given string of length in time, based on the directed acyclic word graph (DAWG). In this paper, we present a more space efficient data structure based on the compact DAWG (CDAWG), which can output in time with space, where denotes the minimum of the sizes of the CDAWGs for and for its reversal . For any strings of length , it holds that , and for highly repetitive strings can be sublinear (up to logarithmic) in . We also show that MAWs and their generalization minimal rare words have close relationships with extended bispecial factors, via the CDAWG.
Paper Structure (15 sections, 13 theorems, 4 figures)

This paper contains 15 sections, 13 theorems, 4 figures.

Key Result

Lemma 1

For any MRW $aub$ for string $S$ with $a,b \in \Sigma$ and $u \in \Sigma^*$, $u$ is a maximal repeat in $S$.

Figures (4)

  • Figure 1: $\mathsf{CDAWG}(S)$ for string $S = \mathtt{ababcbababcbc\$}$.
  • Figure 2: $\mathsf{CDAWG}(S)$ (left) and the CDAWG-grammar $\mathcal{G}_{\mathsf{CDAWG}}$ (right) for the running example from Fig. \ref{['fig:CDAWG']}. The dashed arcs represent the suffix links of the nodes of $\mathsf{CDAWG}(S)$. To obtain $\mathsf{str}(X_3) = \mathtt{ababcb}$ for the CDAWG node $X_3$, we first decompress the non-terminal $X_3$ and obtain $\mathtt{ababc}$. We move to the node $X_2$ by following the suffix link of $X_3$. We then decompress the non-terminal $X_2$ and obtain the remaining $\mathtt{b}$.
  • Figure 3: $\mathsf{LPT}^+(S)$ for the running example from Fig. \ref{['fig:CDAWG']} and \ref{['fig:CDAWG_and_grammar']}. The double-lined arcs represent primary edges, and the single-lined arcs represent secondary edges. For edge $(\hat{v}, \hat{u})$ with string label $\mathtt{c\$}$, consider the path $\langle v, u_1 \rangle$ that spells out $\mathtt{c\$}$ and is obtained by the fast link. The CDAWG node that corresponds to $u_2 = \mathtt{bc}$ has a virtual soft Weiner link with label $\mathtt{c}$ pointing to the CDAWG node that corresponds to $\hat{u}$. Node $u_2$ has an out-edge with $\mathtt{b}$. Therefore, $\mathtt{c}u_2\mathtt{b} = \mathtt{cbcb}$ is a MAW for the string $S = \mathtt{ababcbababcbc\$}$.
  • Figure 4: Cases (A), (B), (C), and (D) for computing MAWs from a given edge $(\hat{v}, \hat{u})$ on $\mathsf{LPT}^+(S)$. The bold arc represents the fast link from edge $(\hat{v}, \hat{u})$ to path $\langle v, u \rangle$, where $u = u_1$. The dashed arcs represent suffix links. The dotted arcs in Cases (C) and (D) are additional pointers for $O(1)$-time access from $\hat{u}$ to $\tilde{u}$.

Theorems & Definitions (20)

  • Lemma 1: BelazzouguiC15 and Theorem 1 of PinhoFGR09
  • Lemma 2: Lemma 9 of Inenaga_LCDAWG_2024
  • Lemma 3
  • proof
  • Theorem 1
  • Lemma 4
  • proof
  • Theorem 2
  • proof
  • Lemma 5
  • ...and 10 more