Table of Contents
Fetching ...

Subsequence Matching and Analysis Problems for Formal Languages

Szilárd Zsolt Fazekas, Tore Koß, Florin Manea, Robert Mercaş, Timo Specht

TL;DR

This work studies subsequence-matching and analysis problems for languages given by grammars or automata, extending classical subsequence questions from strings to formal languages. It develops a general decidability framework with two conditions (H1 and H2) that yield decidability for CFLs, while proving undecidability for CSLs, and it provides efficient CFL algorithms for five subsequence-related problems. The results establish REG-like polynomial-time solvability for certain problems and an FPT algorithm in the alphabet size for exist_universal_largerthan_k, along with a polynomial-time method to decide universal-for-all_m and to compute iota_∀; it also introduces deterministic finite automata with translucent letters as an intermediate class with initial decidability/undecidability insights. Finally, it discusses the boundary between decidable and undecidable cases and outlines future work toward intermediate language classes and downward-closure computations for TFAs, aiming to further map the Chomsky-hierarchy landscape in the context of subsequence analysis.

Abstract

In this paper, we study a series of algorithmic problems related to the subsequences occurring in the strings of a given language, under the assumption that this language is succinctly represented by a grammar generating it, or an automaton accepting it. In particular, we focus on the following problems: Given a string $w$ and a language $L$, does there exist a word of $L$ which has $w$ as subsequence? Do all words of $L$ have $w$ as a subsequence? Given an integer $k$ alongside $L$, does there exist a word of $L$ which has all strings of length $k$, over the alphabet of $L$, as subsequences? Do all words of $L$ have all strings of length $k$ as subsequences? For the last two problems, efficient algorithms were already presented in [Adamson et al., ISAAC 2023] for the case when $L$ is a regular language, and efficient solutions can be easily obtained for the first two problems. We extend that work as follows: we give sufficient conditions on the class of input-languages, under which these problems are decidable; we provide efficient algorithms for all these problems in the case when the input language is context-free; we show that all problems are undecidable for context-sensitive languages. Finally, we provide a series of initial results related to a class of languages that strictly includes the regular languages and is strictly included in the class of context-sensitive languages, but is incomparable to the of class context-free languages; these results deviate significantly from those reported for language-classes from the Chomsky hierarchy.

Subsequence Matching and Analysis Problems for Formal Languages

TL;DR

This work studies subsequence-matching and analysis problems for languages given by grammars or automata, extending classical subsequence questions from strings to formal languages. It develops a general decidability framework with two conditions (H1 and H2) that yield decidability for CFLs, while proving undecidability for CSLs, and it provides efficient CFL algorithms for five subsequence-related problems. The results establish REG-like polynomial-time solvability for certain problems and an FPT algorithm in the alphabet size for exist_universal_largerthan_k, along with a polynomial-time method to decide universal-for-all_m and to compute iota_∀; it also introduces deterministic finite automata with translucent letters as an intermediate class with initial decidability/undecidability insights. Finally, it discusses the boundary between decidable and undecidable cases and outlines future work toward intermediate language classes and downward-closure computations for TFAs, aiming to further map the Chomsky-hierarchy landscape in the context of subsequence analysis.

Abstract

In this paper, we study a series of algorithmic problems related to the subsequences occurring in the strings of a given language, under the assumption that this language is succinctly represented by a grammar generating it, or an automaton accepting it. In particular, we focus on the following problems: Given a string and a language , does there exist a word of which has as subsequence? Do all words of have as a subsequence? Given an integer alongside , does there exist a word of which has all strings of length , over the alphabet of , as subsequences? Do all words of have all strings of length as subsequences? For the last two problems, efficient algorithms were already presented in [Adamson et al., ISAAC 2023] for the case when is a regular language, and efficient solutions can be easily obtained for the first two problems. We extend that work as follows: we give sufficient conditions on the class of input-languages, under which these problems are decidable; we provide efficient algorithms for all these problems in the case when the input language is context-free; we show that all problems are undecidable for context-sensitive languages. Finally, we provide a series of initial results related to a class of languages that strictly includes the regular languages and is strictly included in the class of context-sensitive languages, but is incomparable to the of class context-free languages; these results deviate significantly from those reported for language-classes from the Chomsky hierarchy.

Paper Structure

This paper contains 7 sections, 17 theorems, 6 equations, 3 figures.

Key Result

Lemma 1

Given a string $w\in \Sigma^*$, with $|w|=n$ and $|\Sigma|=\sigma$, we can construct in time ${\mathcal{O}}(n\sigma )$ a minimal DFA, with $n+1$ states, accepting the set of strings which have $w$ as a subsequence.

Figures (3)

  • Figure 1: TFA that accepts the language $w\shuffle h(w)$, where $w\in \{a,b\}^*$ and $h$ is a morphism of the form $h(a)=c$, $h(b)=d$.
  • Figure 2: The TFA accepting $i_1\cdots i_\ell \shuffle pr_1(u_{i_1}\cdots u_{i_\ell}) \shuffle pr_2(v_{i_1}\cdots v_{i_\ell}) \shuffle pr_3(w) \shuffle pr_4(w)\shuffle \#^{m}$, where $i_1,\dots,i_\ell \in [k]$, $w,u_j,v_j\in \Sigma^*$, $a,b,\dots \in \Sigma$, and $x\in \Sigma_1^*$ represents all strings over the alphabet with subscript $1$ such that $x\neq pr_1(u_1)$, and $y\in \Sigma_2^*$ represents all strings over alphabet with subscript $2$ such that $y\neq pr_2(v_1)$.
  • Figure 3: A simple graph (without a Hamiltonian cycle) and part of the corresponding TFA as constructed in the proof of Theorem \ref{['Thm:Prob1-dfawtl-NPhardness']}. Gadgets $A_1$ and $A_2$ are marked with rectangles.

Theorems & Definitions (42)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • ...and 32 more