Absent Subsequences in Words

Maria Kosche; Tore Koß; Florin Manea; Stefan Siemer

Absent Subsequences in Words

Maria Kosche, Tore Koß, Florin Manea, Stefan Siemer

TL;DR

Absent subsequences of words are analyzed through the $\mathrm{SAS}$ (shortest absent subsequences) and $\mathrm{MAS}$ (minimal absent subsequences) frameworks, with a focus on the universality index $\iota(w)$ and arch factorisation. The authors develop combinatorial characterisations and demonstrate exponential vs. polynomial growth phenomena via $A_k$ and $B_k$ constructions, motivating the need for compact representations. They then present two linear- and near-linear-time data-structure approaches: a compact SAS representation based on the arch-tree $\mathcal A_w$ enabling $O(1)$ sasRange queries after $O(n)$ preprocessing and lexicographically minimal SAS construction, and a compact MAS representation via the DAG $\mathcal D_w$ enabling efficient MAS testing, longest MAS computation, and MAS-extension queries with RMQ acceleration. These results yield practical, query-efficient encodings of absent subsequences with potential applications in verification and bioinformatics contexts. $\iota(w)$ denotes the largest $k$ such that all strings of length at most $k$ over $\mathrm{alph}(w)$ appear as subsequences, and $\mathrm{SAS}$/$\mathrm{MAS}$ capture minimal and shortest absent subsequences, respectively.

Abstract

An absent factor of a string $w$ is a string $u$ which does not occur as a contiguous substring (a.k.a. factor) inside $w$. We extend this well-studied notion and define absent subsequences: a string $u$ is an absent subsequence of a string $w$ if $u$ does not occur as subsequence (a.k.a. scattered factor) inside $w$. Of particular interest to us are minimal absent subsequences, i.e., absent subsequences whose every subsequence is not absent, and shortest absent subsequences, i.e., absent subsequences of minimal length. We show a series of combinatorial and algorithmic results regarding these two notions. For instance: we give combinatorial characterisations of the sets of minimal and, respectively, shortest absent subsequences in a word, as well as compact representations of these sets; we show how we can test efficiently if a string is a shortest or minimal absent subsequence in a word, and we give efficient algorithms computing the lexicographically smallest absent subsequence of each kind; also, we show how a data structure for answering shortest absent subsequence-queries for the factors of a given string can be efficiently computed.

Absent Subsequences in Words

TL;DR

Absent subsequences of words are analyzed through the

(shortest absent subsequences) and

(minimal absent subsequences) frameworks, with a focus on the universality index

and arch factorisation. The authors develop combinatorial characterisations and demonstrate exponential vs. polynomial growth phenomena via

and

constructions, motivating the need for compact representations. They then present two linear- and near-linear-time data-structure approaches: a compact SAS representation based on the arch-tree

enabling

sasRange queries after

preprocessing and lexicographically minimal SAS construction, and a compact MAS representation via the DAG

enabling efficient MAS testing, longest MAS computation, and MAS-extension queries with RMQ acceleration. These results yield practical, query-efficient encodings of absent subsequences with potential applications in verification and bioinformatics contexts.

denotes the largest

such that all strings of length at most

over

appear as subsequences, and

capture minimal and shortest absent subsequences, respectively.

Abstract

An absent factor of a string

is a string

which does not occur as a contiguous substring (a.k.a. factor) inside

. We extend this well-studied notion and define absent subsequences: a string

is an absent subsequence of a string

does not occur as subsequence (a.k.a. scattered factor) inside

. Of particular interest to us are minimal absent subsequences, i.e., absent subsequences whose every subsequence is not absent, and shortest absent subsequences, i.e., absent subsequences of minimal length. We show a series of combinatorial and algorithmic results regarding these two notions. For instance: we give combinatorial characterisations of the sets of minimal and, respectively, shortest absent subsequences in a word, as well as compact representations of these sets; we show how we can test efficiently if a string is a shortest or minimal absent subsequence in a word, and we give efficient algorithms computing the lexicographically smallest absent subsequence of each kind; also, we show how a data structure for answering shortest absent subsequence-queries for the factors of a given string can be efficiently computed.

Absent Subsequences in Words

TL;DR

Abstract

Absent Subsequences in Words

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (40)