Table of Contents
Fetching ...

Finding Diverse Strings and Longest Common Subsequences in a Graph

Yuto Shida, Giulia Punzi, Yasuaki Kobayashi, Takeaki Uno, Hiroki Arimura

TL;DR

The paper defines and analyzes Diverse LCSs under the Hamming distance, introducing Sum and Min diversity metrics to select a diverse set of $K$ LCSs. It develops a Sigma-DAG framework to succinctly represent LCS sets and establishes a complete complexity landscape: when $K$ is fixed, both Max-Min and Max-Sum problems are solvable in polynomial time, while unbounded $K$ yields NP-hardness with a PTAS for Max-Sum and FPT algorithms parameterized by $K$ and $r$. It further shows that the positive results rely on the Sigma-DAG representation, while negative results hold for explicit input strings, supported by reductions from classical NP-hard problems and W[1]-hardness results. The work combines dynamic programming, color-coding, and negative-type metric properties to deliver both exact and approximate algorithms, plus a rigorous transfer of hardness between Diverse String Set and Diverse LCSs. Overall, it advances understanding of diversity-aware sequence problems and provides practical, theoretically-grounded methods for obtaining diverse LCS subsets in structured string representations.

Abstract

In this paper, we study for the first time the Diverse Longest Common Subsequences (LCSs) problem under Hamming distance. Given a set of a constant number of input strings, the problem asks to decide if there exists some subset $\mathcal X$ of $K$ longest common subsequences whose diversity is no less than a specified threshold $Δ$, where we consider two types of diversities of a set $\mathcal X$ of strings of equal length: the Sum diversity and the Min diversity defined as the sum and the minimum of the pairwise Hamming distance between any two strings in $\mathcal X$, respectively. We analyze the computational complexity of the respective problems with Sum- and Min-diversity measures, called the Max-Sum and Max-Min Diverse LCSs, respectively, considering both approximation algorithms and parameterized complexity. Our results are summarized as follows. When $K$ is bounded, both problems are polynomial time solvable. In contrast, when $K$ is unbounded, both problems become NP-hard, while Max-Sum Diverse LCSs problem admits a PTAS. Furthermore, we analyze the parameterized complexity of both problems with combinations of parameters $K$ and $r$, where $r$ is the length of the candidate strings to be selected. Importantly, all positive results above are proven in a more general setting, where an input is an edge-labeled directed acyclic graph (DAG) that succinctly represents a set of strings of the same length. Negative results are proven in the setting where an input is explicitly given as a set of strings. The latter results are equipped with an encoding such a set as the longest common subsequences of a specific input string set.

Finding Diverse Strings and Longest Common Subsequences in a Graph

TL;DR

The paper defines and analyzes Diverse LCSs under the Hamming distance, introducing Sum and Min diversity metrics to select a diverse set of LCSs. It develops a Sigma-DAG framework to succinctly represent LCS sets and establishes a complete complexity landscape: when is fixed, both Max-Min and Max-Sum problems are solvable in polynomial time, while unbounded yields NP-hardness with a PTAS for Max-Sum and FPT algorithms parameterized by and . It further shows that the positive results rely on the Sigma-DAG representation, while negative results hold for explicit input strings, supported by reductions from classical NP-hard problems and W[1]-hardness results. The work combines dynamic programming, color-coding, and negative-type metric properties to deliver both exact and approximate algorithms, plus a rigorous transfer of hardness between Diverse String Set and Diverse LCSs. Overall, it advances understanding of diversity-aware sequence problems and provides practical, theoretically-grounded methods for obtaining diverse LCS subsets in structured string representations.

Abstract

In this paper, we study for the first time the Diverse Longest Common Subsequences (LCSs) problem under Hamming distance. Given a set of a constant number of input strings, the problem asks to decide if there exists some subset of longest common subsequences whose diversity is no less than a specified threshold , where we consider two types of diversities of a set of strings of equal length: the Sum diversity and the Min diversity defined as the sum and the minimum of the pairwise Hamming distance between any two strings in , respectively. We analyze the computational complexity of the respective problems with Sum- and Min-diversity measures, called the Max-Sum and Max-Min Diverse LCSs, respectively, considering both approximation algorithms and parameterized complexity. Our results are summarized as follows. When is bounded, both problems are polynomial time solvable. In contrast, when is unbounded, both problems become NP-hard, while Max-Sum Diverse LCSs problem admits a PTAS. Furthermore, we analyze the parameterized complexity of both problems with combinations of parameters and , where is the length of the candidate strings to be selected. Importantly, all positive results above are proven in a more general setting, where an input is an edge-labeled directed acyclic graph (DAG) that succinctly represents a set of strings of the same length. Negative results are proven in the setting where an input is explicitly given as a set of strings. The latter results are equipped with an encoding such a set as the longest common subsequences of a specific input string set.
Paper Structure (16 sections, 20 theorems, 3 equations, 4 figures, 2 tables, 4 algorithms)

This paper contains 16 sections, 20 theorems, 3 equations, 4 figures, 2 tables, 4 algorithms.

Key Result

Lemma 1

For any constant $m \geqslant 1$ and any set $\mathcal{S} = \{S_1, \dots, S_m\}\subseteq \Sigma^*$ of $m$ strings, there exists a $\Sigma$-DAG $G$ of polynomial size in $\ell := \mathrm{maxlen}(\mathcal{S})$ such that $L(G) = LCS(\mathcal{S})$, and $G$ can be computed in polynomial time in $\ell$.

Figures (4)

  • Figure 1: Illustration of \ref{['algo:k:const:dp']} based on dynamic programming. In (a) a $\Sigma$-DAG $G_1$ represents six LCSs in \ref{['fig:example:lcs']}. In (b), circles and arrows indicate the states of the algorithm, which are $K$-tuples of vertices of $G_1$, and transition between them, respectively. All states are associated with a set of $K\times K$-weight matrices, which are shown only for the sink $ttt$ in the figure.
  • Figure 2: Illustration of the proof for \ref{['lem:const:trie']}, where dashed lines indicates a correspondence $\varphi$.
  • Figure 3: An example of reduction for the proof of \ref{['thm:w1hard:div:strdag']} in the case of $n = 5$, consisting of an instance $G$ of Clique, with a vertex set $V = \{1,\dots,5\}$ and a edge set $E \subseteq \mathcal{E} = \{12, 13, \dots, 45\}$ (left), and the associated instance $F = \{S_1, \dots, S_n\}$ of Diverse $r$-String Set, where $F$ contains $n=5$$r$-strings with $r = |\mathcal{E}| = 10$ (right). Shadowed cells indicate the occurrences of symbol $0$.
  • Figure 4: Construction of the FPT-reduction from Max-Min Diverse String Set to Max-Min Diverse LCS in the proof of \ref{['lem:fptreduce:strdag:to:lcs']}, where $s = 4$. We show (a) the set $\mathcal{T}$ of $s$$r$-strings and (b) a pair of input strings $S_1$ and $S_2$. Red and blue parallelograms, respectively, indicate allowed and prohibited matchings between the copies of blocks $T_3 = A_3 W_3 B_3$ in $S_1$ and $S_2$.

Theorems & Definitions (22)

  • Lemma 1: $\Sigma$-DAG for LCSs
  • Remark 2
  • Definition 3
  • Lemma 4: recurrence for $\mathtt{Weights}$
  • Theorem 5: Polynomial time complexity of Max-Min Diverse String Set
  • Lemma 6: recurrence for $\mathtt{Weights}'$
  • Theorem 7: Polynomial time complexity of Max-Sum Diverse String Set
  • Lemma 8: Deza and Laurent deza1997geometry:book
  • Theorem 9: Cevallos et al. cevallos2019improved
  • Lemma 10
  • ...and 12 more