Table of Contents
Fetching ...

Absent Subsequences in Words

Maria Kosche, Tore Koß, Florin Manea, Stefan Siemer

TL;DR

Absent subsequences of words are analyzed through the $\mathrm{SAS}$ (shortest absent subsequences) and $\mathrm{MAS}$ (minimal absent subsequences) frameworks, with a focus on the universality index $\iota(w)$ and arch factorisation. The authors develop combinatorial characterisations and demonstrate exponential vs. polynomial growth phenomena via $A_k$ and $B_k$ constructions, motivating the need for compact representations. They then present two linear- and near-linear-time data-structure approaches: a compact SAS representation based on the arch-tree $\mathcal A_w$ enabling $O(1)$ sasRange queries after $O(n)$ preprocessing and lexicographically minimal SAS construction, and a compact MAS representation via the DAG $\mathcal D_w$ enabling efficient MAS testing, longest MAS computation, and MAS-extension queries with RMQ acceleration. These results yield practical, query-efficient encodings of absent subsequences with potential applications in verification and bioinformatics contexts. $\iota(w)$ denotes the largest $k$ such that all strings of length at most $k$ over $\mathrm{alph}(w)$ appear as subsequences, and $\mathrm{SAS}$/$\mathrm{MAS}$ capture minimal and shortest absent subsequences, respectively.

Abstract

An absent factor of a string $w$ is a string $u$ which does not occur as a contiguous substring (a.k.a. factor) inside $w$. We extend this well-studied notion and define absent subsequences: a string $u$ is an absent subsequence of a string $w$ if $u$ does not occur as subsequence (a.k.a. scattered factor) inside $w$. Of particular interest to us are minimal absent subsequences, i.e., absent subsequences whose every subsequence is not absent, and shortest absent subsequences, i.e., absent subsequences of minimal length. We show a series of combinatorial and algorithmic results regarding these two notions. For instance: we give combinatorial characterisations of the sets of minimal and, respectively, shortest absent subsequences in a word, as well as compact representations of these sets; we show how we can test efficiently if a string is a shortest or minimal absent subsequence in a word, and we give efficient algorithms computing the lexicographically smallest absent subsequence of each kind; also, we show how a data structure for answering shortest absent subsequence-queries for the factors of a given string can be efficiently computed.

Absent Subsequences in Words

TL;DR

Absent subsequences of words are analyzed through the (shortest absent subsequences) and (minimal absent subsequences) frameworks, with a focus on the universality index and arch factorisation. The authors develop combinatorial characterisations and demonstrate exponential vs. polynomial growth phenomena via and constructions, motivating the need for compact representations. They then present two linear- and near-linear-time data-structure approaches: a compact SAS representation based on the arch-tree enabling sasRange queries after preprocessing and lexicographically minimal SAS construction, and a compact MAS representation via the DAG enabling efficient MAS testing, longest MAS computation, and MAS-extension queries with RMQ acceleration. These results yield practical, query-efficient encodings of absent subsequences with potential applications in verification and bioinformatics contexts. denotes the largest such that all strings of length at most over appear as subsequences, and / capture minimal and shortest absent subsequences, respectively.

Abstract

An absent factor of a string is a string which does not occur as a contiguous substring (a.k.a. factor) inside . We extend this well-studied notion and define absent subsequences: a string is an absent subsequence of a string if does not occur as subsequence (a.k.a. scattered factor) inside . Of particular interest to us are minimal absent subsequences, i.e., absent subsequences whose every subsequence is not absent, and shortest absent subsequences, i.e., absent subsequences of minimal length. We show a series of combinatorial and algorithmic results regarding these two notions. For instance: we give combinatorial characterisations of the sets of minimal and, respectively, shortest absent subsequences in a word, as well as compact representations of these sets; we show how we can test efficiently if a string is a shortest or minimal absent subsequence in a word, and we give efficient algorithms computing the lexicographically smallest absent subsequence of each kind; also, we show how a data structure for answering shortest absent subsequence-queries for the factors of a given string can be efficiently computed.

Paper Structure

This paper contains 6 sections, 25 theorems, 1 figure, 1 algorithm.

Key Result

Theorem 3.1

[theorem]thm:mas Let $v,w\in \Sigma^\ast,~|v|=m+1$ and $|w|=n$, then $v$ is an $\mathop{\mathrm{MAS}}\nolimits$ of $w$ if and only if there are positions $0=i_0<i_1<\ldots <i_m <i_{m+1}= n+1$ such that all of the following conditions are satisfied.

Figures (1)

  • Figure 1: Illustration of positions and intervals inside word $w$

Theorems & Definitions (40)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3
  • Definition 2.4
  • Theorem 3.1
  • Remark 3.2
  • Example 3.3
  • Theorem 3.4
  • Example 3.5
  • Proposition 3.6
  • ...and 30 more