Computing Minimal Absent Words and Extended Bispecial Factors with CDAWG Space
Shunsuke Inenaga, Takuya Mieno, Hiroki Arimura, Mitsuru Funakoshi, Yuta Fujishige
TL;DR
This work tackles the computation of minimal absent words (MAWs) and related structures using space-efficient compact DAWGs (CDAWGs). It introduces CDAWG-based grammars and an extended longest path tree to report MAWs, extended bispecial factors (EBF), and minimal rare words (MRW) in time linear to the output while using $O(\mathsf{e}_{\min})$ space, where $\mathsf{e}_{\min}=\min\{\mathsf{el}(S),\mathsf{er}(S)\}$. The authors derive tight combinatorial bounds such as $|\mathsf{MAW}(S)| = O(\sigma \mathsf{e}_{\min})$ and $|\mathsf{EBF}(S)| \le \mathsf{er}(S)+\mathsf{el}(S)-|\mathsf{V}|+1$, and introduce length-bounded reporting with LA-free variants. These results yield practical, space-efficient reporting for MAWs, EBFs, and MRWs, with potential impact in bioinformatics and data compression and a framework that links word-structure properties to CDAWG geometry and grammar compression.
Abstract
A string $w$ is said to be a minimal absent word (MAW) for a string $S$ if $w$ does not occur in $S$ and any proper substring of $w$ occurs in $S$. We focus on non-trivial MAWs which are of length at least 2. Finding such non-trivial MAWs for a given string is motivated for applications in bioinformatics and data compression. Fujishige et al. [TCS 2023] proposed a data structure of size $Θ(n)$ that can output the set $\mathsf{MAW}(S)$ of all MAWs for a given string $S$ of length $n$ in $O(n + |\mathsf{MAW}(S)|)$ time, based on the directed acyclic word graph (DAWG). In this paper, we present a more space efficient data structure based on the compact DAWG (CDAWG), which can output $\mathsf{MAW}(S)$ in $O(|\mathsf{MAW}(S)|)$ time with $O(\mathsf{e}_\min)$ space, where $\mathsf{e}_\min$ denotes the minimum of the sizes of the CDAWGs for $S$ and for its reversal $S^R$. For any strings of length $n$, it holds that $\mathsf{e}_\min < 2n$, and for highly repetitive strings $\mathsf{e}_\min$ can be sublinear (up to logarithmic) in $n$. We also show that MAWs and their generalization minimal rare words have close relationships with extended bispecial factors, via the CDAWG.
