Finding Diverse Strings and Longest Common Subsequences in a Graph
Yuto Shida, Giulia Punzi, Yasuaki Kobayashi, Takeaki Uno, Hiroki Arimura
TL;DR
The paper defines and analyzes Diverse LCSs under the Hamming distance, introducing Sum and Min diversity metrics to select a diverse set of $K$ LCSs. It develops a Sigma-DAG framework to succinctly represent LCS sets and establishes a complete complexity landscape: when $K$ is fixed, both Max-Min and Max-Sum problems are solvable in polynomial time, while unbounded $K$ yields NP-hardness with a PTAS for Max-Sum and FPT algorithms parameterized by $K$ and $r$. It further shows that the positive results rely on the Sigma-DAG representation, while negative results hold for explicit input strings, supported by reductions from classical NP-hard problems and W[1]-hardness results. The work combines dynamic programming, color-coding, and negative-type metric properties to deliver both exact and approximate algorithms, plus a rigorous transfer of hardness between Diverse String Set and Diverse LCSs. Overall, it advances understanding of diversity-aware sequence problems and provides practical, theoretically-grounded methods for obtaining diverse LCS subsets in structured string representations.
Abstract
In this paper, we study for the first time the Diverse Longest Common Subsequences (LCSs) problem under Hamming distance. Given a set of a constant number of input strings, the problem asks to decide if there exists some subset $\mathcal X$ of $K$ longest common subsequences whose diversity is no less than a specified threshold $Δ$, where we consider two types of diversities of a set $\mathcal X$ of strings of equal length: the Sum diversity and the Min diversity defined as the sum and the minimum of the pairwise Hamming distance between any two strings in $\mathcal X$, respectively. We analyze the computational complexity of the respective problems with Sum- and Min-diversity measures, called the Max-Sum and Max-Min Diverse LCSs, respectively, considering both approximation algorithms and parameterized complexity. Our results are summarized as follows. When $K$ is bounded, both problems are polynomial time solvable. In contrast, when $K$ is unbounded, both problems become NP-hard, while Max-Sum Diverse LCSs problem admits a PTAS. Furthermore, we analyze the parameterized complexity of both problems with combinations of parameters $K$ and $r$, where $r$ is the length of the candidate strings to be selected. Importantly, all positive results above are proven in a more general setting, where an input is an edge-labeled directed acyclic graph (DAG) that succinctly represents a set of strings of the same length. Negative results are proven in the setting where an input is explicitly given as a set of strings. The latter results are equipped with an encoding such a set as the longest common subsequences of a specific input string set.
