Absent Subsequences in Words
Maria Kosche, Tore Koß, Florin Manea, Stefan Siemer
TL;DR
Absent subsequences of words are analyzed through the $\mathrm{SAS}$ (shortest absent subsequences) and $\mathrm{MAS}$ (minimal absent subsequences) frameworks, with a focus on the universality index $\iota(w)$ and arch factorisation. The authors develop combinatorial characterisations and demonstrate exponential vs. polynomial growth phenomena via $A_k$ and $B_k$ constructions, motivating the need for compact representations. They then present two linear- and near-linear-time data-structure approaches: a compact SAS representation based on the arch-tree $\mathcal A_w$ enabling $O(1)$ sasRange queries after $O(n)$ preprocessing and lexicographically minimal SAS construction, and a compact MAS representation via the DAG $\mathcal D_w$ enabling efficient MAS testing, longest MAS computation, and MAS-extension queries with RMQ acceleration. These results yield practical, query-efficient encodings of absent subsequences with potential applications in verification and bioinformatics contexts. $\iota(w)$ denotes the largest $k$ such that all strings of length at most $k$ over $\mathrm{alph}(w)$ appear as subsequences, and $\mathrm{SAS}$/$\mathrm{MAS}$ capture minimal and shortest absent subsequences, respectively.
Abstract
An absent factor of a string $w$ is a string $u$ which does not occur as a contiguous substring (a.k.a. factor) inside $w$. We extend this well-studied notion and define absent subsequences: a string $u$ is an absent subsequence of a string $w$ if $u$ does not occur as subsequence (a.k.a. scattered factor) inside $w$. Of particular interest to us are minimal absent subsequences, i.e., absent subsequences whose every subsequence is not absent, and shortest absent subsequences, i.e., absent subsequences of minimal length. We show a series of combinatorial and algorithmic results regarding these two notions. For instance: we give combinatorial characterisations of the sets of minimal and, respectively, shortest absent subsequences in a word, as well as compact representations of these sets; we show how we can test efficiently if a string is a shortest or minimal absent subsequence in a word, and we give efficient algorithms computing the lexicographically smallest absent subsequence of each kind; also, we show how a data structure for answering shortest absent subsequence-queries for the factors of a given string can be efficiently computed.
