Table of Contents
Fetching ...

Faster Algorithms for Longest Common Substring

Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, Solon P. Pissis

TL;DR

This work advances longest common substring (LCS) and k-mismatch LCS algorithms in the word RAM model for small alphabets. It introduces a Two String Families LCP framework built on compacted tries, wavelet trees, and string synchronising sets, enabling sublinear-time LCS for total input length $n$ with space $O(n\log\sigma/\log n)$ and, for constant $k$, subquadratic k-LCS time $O(n\log^{k-1/2} n)$. A key methodological leap is reducing LCS to structured LCP problems via anchors and difference covers, and then solving special cases efficiently with wavelet-tree based data structures. The paper further extends these ideas to multiple input strings (up to $\lambda=O(\sqrt{\log n}/\log\log n)$) and develops a general k-LCS framework that breaks the longstanding $n\log^k n$ barrier, underpinned by bicomplete/complete families and advanced data-structural techniques. Overall, the results push the practical limits of sublinear LCS computations in compressed/packed representations, with conditional lower bounds indicating the near-optimality of the approach for general alphabets.

Abstract

In the classic longest common substring (LCS) problem, we are given two strings $S$ and $T$, each of length at most $n$, over an alphabet of size $σ$, and we are asked to find a longest string occurring as a fragment of both $S$ and $T$. Weiner, in his seminal paper that introduced the suffix tree, presented an $O(n \log σ)$-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an $O(n)$-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in $O(n \log σ/\log n )$ space and read in $O(n \log σ/\log n )$ time. We show that, in this model, we can compute an LCS in time $O(n \log σ/ \sqrt{\log n})$, which is sublinear in $n$ if $σ=2^{o(\sqrt{\log n})}$ (in particular, if $σ=O(1)$), using optimal space $O(n \log σ/\log n)$. In fact, it was recently shown that this result is conditionally optimal [Kempa and Kociumaka, STOC 2025]. We then lift our ideas to the problem of computing a $k$-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of $S$ that occurs in $T$ with at most $k$ mismatches. Thankachan et al.~showed how to compute a $k$-mismatch LCS in $O(n \log^k n)$ time for $k=O(1)$ [J. Comput. Biol. 2016]. We show an $O(n \log^{k-1/2} n)$-time algorithm, for any constant $k>0$ and irrespective of the alphabet size, using $O(n)$ space as the previous approaches. We thus notably break through the well-known $n \log^k n$ barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with $k$ errors.

Faster Algorithms for Longest Common Substring

TL;DR

This work advances longest common substring (LCS) and k-mismatch LCS algorithms in the word RAM model for small alphabets. It introduces a Two String Families LCP framework built on compacted tries, wavelet trees, and string synchronising sets, enabling sublinear-time LCS for total input length with space and, for constant , subquadratic k-LCS time . A key methodological leap is reducing LCS to structured LCP problems via anchors and difference covers, and then solving special cases efficiently with wavelet-tree based data structures. The paper further extends these ideas to multiple input strings (up to ) and develops a general k-LCS framework that breaks the longstanding barrier, underpinned by bicomplete/complete families and advanced data-structural techniques. Overall, the results push the practical limits of sublinear LCS computations in compressed/packed representations, with conditional lower bounds indicating the near-optimality of the approach for general alphabets.

Abstract

In the classic longest common substring (LCS) problem, we are given two strings and , each of length at most , over an alphabet of size , and we are asked to find a longest string occurring as a fragment of both and . Weiner, in his seminal paper that introduced the suffix tree, presented an -time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an -time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in space and read in time. We show that, in this model, we can compute an LCS in time , which is sublinear in if (in particular, if ), using optimal space . In fact, it was recently shown that this result is conditionally optimal [Kempa and Kociumaka, STOC 2025]. We then lift our ideas to the problem of computing a -mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of that occurs in with at most mismatches. Thankachan et al.~showed how to compute a -mismatch LCS in time for [J. Comput. Biol. 2016]. We show an -time algorithm, for any constant and irrespective of the alphabet size, using space as the previous approaches. We thus notably break through the well-known barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with errors.

Paper Structure

This paper contains 40 sections, 44 theorems, 64 equations, 9 figures, 3 algorithms.

Key Result

theorem 1.1

Given two strings $S$ and $T$ of total length $n$ over an alphabet $[0\mathinner{.\,.} \sigma)$, the LCS problem can be solved in $\mathcal{O}(n\log\sigma/\sqrt{\log n})$ time using $\mathcal{O}(n \log\sigma/ \log n)$ space.

Figures (9)

  • Figure 1: Left: The LCS of the two strings has length 5. Right: The 1-LCS (1-mismatch LCS) of the same two strings has length 7; the mismatching letters are shown in red.
  • Figure 2: An example of a $\tau$-synchronising set ($\tau=3$) of this string is $A=\{1,3,6,12,13,16\}$. Fragments $T[1 \mathinner{.\,.} 7)$ and $T[16 \mathinner{.\,.} 22)$ match and thus $1$, $16$ are both in $A$. Fragments $T[2 \mathinner{.\,.} 8)$ and $T[17 \mathinner{.\,.} 23)$ match and thus $2$, $17$ are both not in $A$. Among every 3 consecutive positions (that are sufficiently far from the end of the string) there is a synchronising position, except for the positions $7,\ldots,11$ which imply a long fragment with period $\frac{1}{3}\tau=1$ (a so-called $\tau$-run).
  • Figure 3: Left: partitioning of $S$ and $T$ into fragments of length (up to) $2m$. All distinct substrings in $S$ and in $T$ are $X_1,\ldots,X_4$ and $Y_1,\ldots,Y_5$, respectively. The LCS of length $m$ (in blue) is a substring of $X_1$ in $S$ and of $Y_2$ in $T$. Right: strings $X$ and $Y$; their LCS (shown in blue) is the same as the LCS of $S$ and $T$.
  • Figure 4: Strings from \ref{['fig:0-1-LCS']} with their LCS $S[3 \mathinner{.\,.} 8)=T[4 \mathinner{.\,.} 9)$ of length $\ell=5$. The dots correspond to elements of a 5-cover $(D,h)$ where $D=\{1,2,4,\,6,7,9,\,11,12,14,\ldots\}$. For $i=3$, $j=4$, we have $i'=i+h(i,j)=6$, $j'=j+h(i,j)=7$, and $\mathsf{LCP}((S[1 \mathinner{.\,.} i'))^R,(T[1 \mathinner{.\,.} j'))^R)+\mathsf{LCP}(S[i' \mathinner{.\,.} |S|], T[j' \mathinner{.\,.} |T|])=\ell=5$.
  • Figure 5: The setting in Lemma \ref{['lem:suffix']} on list $\mathcal{R}$. With red color, we denote the elements of $\mathcal{P}$ and with blue color the elements of $\mathcal{Q}$. For element $e=(U,V)$ from $\mathcal{Q}$, we have $\textsf{lex-pred}(e) = (Y_1,Y_2)$ and $\textsf{lex-succ}(e) = (Z_1,Z_2)$.
  • ...and 4 more figures

Theorems & Definitions (92)

  • theorem 1.1
  • theorem 1.2
  • Lemma 1.3: DBLP:conf/cpm/Charalampopoulos18
  • Example 1
  • Lemma 1.3
  • theorem 1.4
  • Lemma 2.2: Periodicity Lemma (weak version) FW:periodicity-lemma
  • Lemma 2.3
  • Proposition 2.3
  • theorem 2.4: DBLP:conf/stoc/KempaK19
  • ...and 82 more