Subsequence Matching and LCS with Segment Number Constraints
Yuki Yonemoto, Takuya Mieno, Shunsuke Inenaga, Ryo Yoshinaka, Ayumi Shinohara
TL;DR
This work investigates two segment-constrained subsequence problems, SegE and SegLCS, introducing the formalism of $f$-segmentations and embeddings. It delivers a near-complete complexity picture: a strong conditional $O((mn)^{1-\epsilon})$ lower bound for SegE under SETH, an $O(mn)$ time algorithm for SegE, and a fast $O(f|T_2|(|T_1|-\ell+1))$-time algorithm for SegLCS when the optimal length $\ell$ is large. The SegE algorithms hinge on an affine-gap DP formulation and efficient two-pointer/KMP-based preprocessing for special cases, while the SegLCS algorithm fuses a Banerjee et al. DP with Nakatsu et al.'s diagonal LCS approach to exploit long solutions. Together, these results map the computational landscape of segmental subsequences and guide future work on index structures, alphabet reductions, and related segmentation-generalized LCS problems.
Abstract
The longest common subsequence (LCS) is a fundamental problem in string processing which has numerous algorithmic studies, extensions, and applications. A sequence $u_1, \ldots, u_f$ of $f$ strings s said to be an ($f$-)segmentation of a string $P$ if $P = u_1 \cdots u_f$. Li et al. [BIBM 2022] proposed a new variant of the LCS problem for given strings $T_1, T_2$ and an integer $f$, which we hereby call the segmental LCS problem (SegLCS), of finding (the length of) a longest string $P$ that has an $f$-segmentation which can be embedded into both $T_1$ and $T_2$. Li et al. [IJTCS-FAW 2024] gave a dynamic programming solution that solves SegLCS in $O(fn_1n_2)$ time with $O(fn_1 + n_2)$ space, where $n_1 = |T_1|$, $n_2 = |T_2|$, and $n_1 \le n_2$. Recently, Banerjee et al. [ESA 2024] presented an algorithm which, for a constant $f \geq 3$, solves SegLCS in $\tilde{O}((n_1n_2)^{1-(1/3)^{f-2}})$ time. In this paper, we deal with SegLCS as well as the problem of segmental subsequence pattern matching, SegE, that asks to determine whether a pattern $P$ of length $m$ has an $f$-segmentation that can be embedded into a text $T$ of length $n$. When $f = 1$, this is equivalent to substring matching, and when $f = |P|$, this is equivalent to subsequence matching. Our focus in this article is the case of general values of $f$, and our main contributions are threefold: (1) $O((mn)^{1-ε})$-time conditional lower bound for SegE under the strong exponential-time hypothesis (SETH), for any constant $ε> 0$. (2) $O(mn)$-time algorithm for SegE. (3) $O(fn_2(n_1 - \ell+1))$-time algorithm for SegLCS where $\ell$ is the solution length.
