Faster Algorithms for Longest Common Substring
Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, Solon P. Pissis
TL;DR
This work advances longest common substring (LCS) and k-mismatch LCS algorithms in the word RAM model for small alphabets. It introduces a Two String Families LCP framework built on compacted tries, wavelet trees, and string synchronising sets, enabling sublinear-time LCS for total input length $n$ with space $O(n\log\sigma/\log n)$ and, for constant $k$, subquadratic k-LCS time $O(n\log^{k-1/2} n)$. A key methodological leap is reducing LCS to structured LCP problems via anchors and difference covers, and then solving special cases efficiently with wavelet-tree based data structures. The paper further extends these ideas to multiple input strings (up to $\lambda=O(\sqrt{\log n}/\log\log n)$) and develops a general k-LCS framework that breaks the longstanding $n\log^k n$ barrier, underpinned by bicomplete/complete families and advanced data-structural techniques. Overall, the results push the practical limits of sublinear LCS computations in compressed/packed representations, with conditional lower bounds indicating the near-optimality of the approach for general alphabets.
Abstract
In the classic longest common substring (LCS) problem, we are given two strings $S$ and $T$, each of length at most $n$, over an alphabet of size $σ$, and we are asked to find a longest string occurring as a fragment of both $S$ and $T$. Weiner, in his seminal paper that introduced the suffix tree, presented an $O(n \log σ)$-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an $O(n)$-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in $O(n \log σ/\log n )$ space and read in $O(n \log σ/\log n )$ time. We show that, in this model, we can compute an LCS in time $O(n \log σ/ \sqrt{\log n})$, which is sublinear in $n$ if $σ=2^{o(\sqrt{\log n})}$ (in particular, if $σ=O(1)$), using optimal space $O(n \log σ/\log n)$. In fact, it was recently shown that this result is conditionally optimal [Kempa and Kociumaka, STOC 2025]. We then lift our ideas to the problem of computing a $k$-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of $S$ that occurs in $T$ with at most $k$ mismatches. Thankachan et al.~showed how to compute a $k$-mismatch LCS in $O(n \log^k n)$ time for $k=O(1)$ [J. Comput. Biol. 2016]. We show an $O(n \log^{k-1/2} n)$-time algorithm, for any constant $k>0$ and irrespective of the alphabet size, using $O(n)$ space as the previous approaches. We thus notably break through the well-known $n \log^k n$ barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with $k$ errors.
