Table of Contents
Fetching ...

Subsequence Matching and LCS with Segment Number Constraints

Yuki Yonemoto, Takuya Mieno, Shunsuke Inenaga, Ryo Yoshinaka, Ayumi Shinohara

TL;DR

This work investigates two segment-constrained subsequence problems, SegE and SegLCS, introducing the formalism of $f$-segmentations and embeddings. It delivers a near-complete complexity picture: a strong conditional $O((mn)^{1-\epsilon})$ lower bound for SegE under SETH, an $O(mn)$ time algorithm for SegE, and a fast $O(f|T_2|(|T_1|-\ell+1))$-time algorithm for SegLCS when the optimal length $\ell$ is large. The SegE algorithms hinge on an affine-gap DP formulation and efficient two-pointer/KMP-based preprocessing for special cases, while the SegLCS algorithm fuses a Banerjee et al. DP with Nakatsu et al.'s diagonal LCS approach to exploit long solutions. Together, these results map the computational landscape of segmental subsequences and guide future work on index structures, alphabet reductions, and related segmentation-generalized LCS problems.

Abstract

The longest common subsequence (LCS) is a fundamental problem in string processing which has numerous algorithmic studies, extensions, and applications. A sequence $u_1, \ldots, u_f$ of $f$ strings s said to be an ($f$-)segmentation of a string $P$ if $P = u_1 \cdots u_f$. Li et al. [BIBM 2022] proposed a new variant of the LCS problem for given strings $T_1, T_2$ and an integer $f$, which we hereby call the segmental LCS problem (SegLCS), of finding (the length of) a longest string $P$ that has an $f$-segmentation which can be embedded into both $T_1$ and $T_2$. Li et al. [IJTCS-FAW 2024] gave a dynamic programming solution that solves SegLCS in $O(fn_1n_2)$ time with $O(fn_1 + n_2)$ space, where $n_1 = |T_1|$, $n_2 = |T_2|$, and $n_1 \le n_2$. Recently, Banerjee et al. [ESA 2024] presented an algorithm which, for a constant $f \geq 3$, solves SegLCS in $\tilde{O}((n_1n_2)^{1-(1/3)^{f-2}})$ time. In this paper, we deal with SegLCS as well as the problem of segmental subsequence pattern matching, SegE, that asks to determine whether a pattern $P$ of length $m$ has an $f$-segmentation that can be embedded into a text $T$ of length $n$. When $f = 1$, this is equivalent to substring matching, and when $f = |P|$, this is equivalent to subsequence matching. Our focus in this article is the case of general values of $f$, and our main contributions are threefold: (1) $O((mn)^{1-ε})$-time conditional lower bound for SegE under the strong exponential-time hypothesis (SETH), for any constant $ε> 0$. (2) $O(mn)$-time algorithm for SegE. (3) $O(fn_2(n_1 - \ell+1))$-time algorithm for SegLCS where $\ell$ is the solution length.

Subsequence Matching and LCS with Segment Number Constraints

TL;DR

This work investigates two segment-constrained subsequence problems, SegE and SegLCS, introducing the formalism of -segmentations and embeddings. It delivers a near-complete complexity picture: a strong conditional lower bound for SegE under SETH, an time algorithm for SegE, and a fast -time algorithm for SegLCS when the optimal length is large. The SegE algorithms hinge on an affine-gap DP formulation and efficient two-pointer/KMP-based preprocessing for special cases, while the SegLCS algorithm fuses a Banerjee et al. DP with Nakatsu et al.'s diagonal LCS approach to exploit long solutions. Together, these results map the computational landscape of segmental subsequences and guide future work on index structures, alphabet reductions, and related segmentation-generalized LCS problems.

Abstract

The longest common subsequence (LCS) is a fundamental problem in string processing which has numerous algorithmic studies, extensions, and applications. A sequence of strings s said to be an (-)segmentation of a string if . Li et al. [BIBM 2022] proposed a new variant of the LCS problem for given strings and an integer , which we hereby call the segmental LCS problem (SegLCS), of finding (the length of) a longest string that has an -segmentation which can be embedded into both and . Li et al. [IJTCS-FAW 2024] gave a dynamic programming solution that solves SegLCS in time with space, where , , and . Recently, Banerjee et al. [ESA 2024] presented an algorithm which, for a constant , solves SegLCS in time. In this paper, we deal with SegLCS as well as the problem of segmental subsequence pattern matching, SegE, that asks to determine whether a pattern of length has an -segmentation that can be embedded into a text of length . When , this is equivalent to substring matching, and when , this is equivalent to subsequence matching. Our focus in this article is the case of general values of , and our main contributions are threefold: (1) -time conditional lower bound for SegE under the strong exponential-time hypothesis (SETH), for any constant . (2) -time algorithm for SegE. (3) -time algorithm for SegLCS where is the solution length.
Paper Structure (9 sections, 11 theorems, 23 equations, 3 tables, 2 algorithms)

This paper contains 9 sections, 11 theorems, 23 equations, 3 tables, 2 algorithms.

Key Result

Theorem 1

For any $\epsilon > 0$ and any $\alpha \le 1$, Episode Matching on binary strings $T$ and $P$ with $|P| \in \Theta(|T|^\alpha)$ cannot be solved in $O((|T||P|)^{1-\epsilon})$ time, unless SETH is false.

Theorems & Definitions (18)

  • Conjecture 1: The Strong Exponential-Time Hypothesis; SETH
  • Theorem 1: Bille2022
  • Corollary 1: Bille2022
  • Theorem 2
  • proof
  • Example 1
  • Theorem 3
  • proof
  • Theorem 4
  • Lemma 1
  • ...and 8 more