Faster Algorithms for Longest Common Substring

Panagiotis Charalampopoulos; Tomasz Kociumaka; Jakub Radoszewski; Solon P. Pissis

Faster Algorithms for Longest Common Substring

Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, Solon P. Pissis

TL;DR

This work advances longest common substring (LCS) and k-mismatch LCS algorithms in the word RAM model for small alphabets. It introduces a Two String Families LCP framework built on compacted tries, wavelet trees, and string synchronising sets, enabling sublinear-time LCS for total input length $n$ with space $O(n\log\sigma/\log n)$ and, for constant $k$, subquadratic k-LCS time $O(n\log^{k-1/2} n)$. A key methodological leap is reducing LCS to structured LCP problems via anchors and difference covers, and then solving special cases efficiently with wavelet-tree based data structures. The paper further extends these ideas to multiple input strings (up to $\lambda=O(\sqrt{\log n}/\log\log n)$) and develops a general k-LCS framework that breaks the longstanding $n\log^k n$ barrier, underpinned by bicomplete/complete families and advanced data-structural techniques. Overall, the results push the practical limits of sublinear LCS computations in compressed/packed representations, with conditional lower bounds indicating the near-optimality of the approach for general alphabets.

Abstract

In the classic longest common substring (LCS) problem, we are given two strings $S$ and $T$, each of length at most $n$, over an alphabet of size $σ$, and we are asked to find a longest string occurring as a fragment of both $S$ and $T$. Weiner, in his seminal paper that introduced the suffix tree, presented an $O(n \log σ)$-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an $O(n)$-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in $O(n \log σ/\log n )$ space and read in $O(n \log σ/\log n )$ time. We show that, in this model, we can compute an LCS in time $O(n \log σ/ \sqrt{\log n})$, which is sublinear in $n$ if $σ=2^{o(\sqrt{\log n})}$ (in particular, if $σ=O(1)$), using optimal space $O(n \log σ/\log n)$. In fact, it was recently shown that this result is conditionally optimal [Kempa and Kociumaka, STOC 2025]. We then lift our ideas to the problem of computing a $k$-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of $S$ that occurs in $T$ with at most $k$ mismatches. Thankachan et al.~showed how to compute a $k$-mismatch LCS in $O(n \log^k n)$ time for $k=O(1)$ [J. Comput. Biol. 2016]. We show an $O(n \log^{k-1/2} n)$-time algorithm, for any constant $k>0$ and irrespective of the alphabet size, using $O(n)$ space as the previous approaches. We thus notably break through the well-known $n \log^k n$ barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with $k$ errors.

Faster Algorithms for Longest Common Substring

TL;DR

with space

and, for constant

, subquadratic k-LCS time

. A key methodological leap is reducing LCS to structured LCP problems via anchors and difference covers, and then solving special cases efficiently with wavelet-tree based data structures. The paper further extends these ideas to multiple input strings (up to

) and develops a general k-LCS framework that breaks the longstanding

barrier, underpinned by bicomplete/complete families and advanced data-structural techniques. Overall, the results push the practical limits of sublinear LCS computations in compressed/packed representations, with conditional lower bounds indicating the near-optimality of the approach for general alphabets.

Abstract

In the classic longest common substring (LCS) problem, we are given two strings

and

, each of length at most

, over an alphabet of size

, and we are asked to find a longest string occurring as a fragment of both

and

. Weiner, in his seminal paper that introduced the suffix tree, presented an

-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an

-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in

space and read in

time. We show that, in this model, we can compute an LCS in time

, which is sublinear in

(in particular, if

), using optimal space

. In fact, it was recently shown that this result is conditionally optimal [Kempa and Kociumaka, STOC 2025]. We then lift our ideas to the problem of computing a

-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of

that occurs in

with at most

mismatches. Thankachan et al.~showed how to compute a

-mismatch LCS in

time for

[J. Comput. Biol. 2016]. We show an

-time algorithm, for any constant

and irrespective of the alphabet size, using

space as the previous approaches. We thus notably break through the well-known

barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with

errors.

Faster Algorithms for Longest Common Substring

TL;DR

Abstract

Faster Algorithms for Longest Common Substring

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (92)