Table of Contents
Fetching ...

The Complexity of Maximal Common Subsequence Enumeration

Giovanni Buzzega, Alessio Conte, Yasuaki Kobayashi, Kazuhiro Kurita, Giulia Punzi

TL;DR

This paper studies the computational complexity of enumerating maximal common subsequences (MCS) across multiple strings, a central instance of maximal frequent subsequence mining and a generalization of the classic longest common subsequence problem. It establishes strong negative results, showing that no output-polynomial-time enumeration algorithm exists for MCS Enumeration unless $P \neq NP$, and demonstrates related hardness for MCS Indexing and MCS Assessment, with $\#\text{-complete}$ counting on binary alphabets. On the positive side, it identifies tractable parameterized regimes: when the shortest string length $l$ is small, or when the number of strings $k$ is fixed, yielding polynomial or quasi-polynomial time algorithms and constant-delay enumeration in certain cases. Together, these results map the hardness landscape of maximal subsequence mining and provide practical avenues for restricted settings, guiding future work on counting and indexing under constrained parameters.

Abstract

Frequent pattern mining is widely used to find ``important'' or ``interesting'' patterns in data. While it is not easy to mathematically define such patterns, maximal frequent patterns are promising candidates, as frequency is a natural indicator of relevance and maximality helps to summarize the output. As such, their mining has been studied on various data types, including itemsets, graphs, and strings. The complexity of mining maximal frequent itemsets and subtrees has been thoroughly investigated (e.g., [Boros et al., 2003], [Uno et al., 2004]) in the literature. On the other hand, while the idea of mining frequent subsequences in sequential data was already introduced in the seminal paper [Agrawal et al., 1995], the complexity of the problem is still open. In this paper, we investigate the complexity of the maximal common subsequence enumeration problem, which is both an important special case of maximal frequent subsequence mining and a generalization of the classic longest common subsequence (LCS) problem. We show the hardness of enumerating maximal common subsequences between multiple strings, ruling out the possibility of an \emph{output-polynomial time} enumeration algorithm under $P \neq NP$, that is, an algorithm that runs in time ${\rm poly}(|\mathcal I| + N)$, where $|\mathcal I|$ and $N$ are the size of the input and number of output solutions, respectively. To circumvent this intractability, we also investigate the parameterized complexity of the problem, and show several results when the alphabet size, the number of strings, and the length of a string are taken into account as parameters.

The Complexity of Maximal Common Subsequence Enumeration

TL;DR

This paper studies the computational complexity of enumerating maximal common subsequences (MCS) across multiple strings, a central instance of maximal frequent subsequence mining and a generalization of the classic longest common subsequence problem. It establishes strong negative results, showing that no output-polynomial-time enumeration algorithm exists for MCS Enumeration unless , and demonstrates related hardness for MCS Indexing and MCS Assessment, with counting on binary alphabets. On the positive side, it identifies tractable parameterized regimes: when the shortest string length is small, or when the number of strings is fixed, yielding polynomial or quasi-polynomial time algorithms and constant-delay enumeration in certain cases. Together, these results map the hardness landscape of maximal subsequence mining and provide practical avenues for restricted settings, guiding future work on counting and indexing under constrained parameters.

Abstract

Frequent pattern mining is widely used to find ``important'' or ``interesting'' patterns in data. While it is not easy to mathematically define such patterns, maximal frequent patterns are promising candidates, as frequency is a natural indicator of relevance and maximality helps to summarize the output. As such, their mining has been studied on various data types, including itemsets, graphs, and strings. The complexity of mining maximal frequent itemsets and subtrees has been thoroughly investigated (e.g., [Boros et al., 2003], [Uno et al., 2004]) in the literature. On the other hand, while the idea of mining frequent subsequences in sequential data was already introduced in the seminal paper [Agrawal et al., 1995], the complexity of the problem is still open. In this paper, we investigate the complexity of the maximal common subsequence enumeration problem, which is both an important special case of maximal frequent subsequence mining and a generalization of the classic longest common subsequence (LCS) problem. We show the hardness of enumerating maximal common subsequences between multiple strings, ruling out the possibility of an \emph{output-polynomial time} enumeration algorithm under , that is, an algorithm that runs in time , where and are the size of the input and number of output solutions, respectively. To circumvent this intractability, we also investigate the parameterized complexity of the problem, and show several results when the alphabet size, the number of strings, and the length of a string are taken into account as parameters.

Paper Structure

This paper contains 12 sections, 17 theorems, 1 figure, 1 table, 1 algorithm.

Key Result

theorem 1

Another MCS is -complete, even for instances where $|\mathcal{Z}| = O(n)$, where $n$ is the maximum length of an input string.

Figures (1)

  • Figure 1: An example of our reduction. Let $E_1 = \{1, 2\}$, $E_2 = \{1,3,4\}$, and $E_3 = \{3,4,5\}$. The maximal independent sets in $\mathcal{H}$ are $\{1,3,5\}$, $\{1, 4, 5\}$, $\{2, 3, 4\}$, $\{2,4,5\}$ and $\{2, 3, 5\}$. The MCSs in $S_0$, $S_1$, $S_2$, and $S_3$ are $01010101$, $01001001$, $01000101$, $00101010$, $00100101$ and $00101001$.

Theorems & Definitions (17)

  • theorem 1
  • Corollary 1
  • Corollary 2
  • Corollary 3
  • theorem 2
  • Corollary 4
  • Corollary 5
  • Corollary 6
  • theorem 3
  • theorem 4
  • ...and 7 more