Table of Contents
Fetching ...

Expected Length of the Longest Common Subsequence of Multiple Strings

Ray Li, William Ren, Yiran Wen

TL;DR

This work analyzes the generalized Chvátal-Sankoff constant $\gamma_{k,d}$, the normalized expected length of the longest common subsequence for $d$ random strings over an alphabet of size $k$. It proves a tight asymptotic for the binary case: $\gamma_{2,d} = \tfrac{1}{2} + \Theta(\tfrac{1}{\sqrt{d}})$, using a greedy diagonal-LCS approach to obtain a strong lower bound and a Guruswami–Wang counting argument for the upper bound. For larger alphabets, it establishes near-optimal bounds when $d \ge \Omega(\log k)$, namely $\frac{1}{k}\bigl(1 + \frac{c_1}{\sqrt{d}}\bigr) \le \gamma_{k,d} \le \frac{1}{k}\bigl(1 + c_2\sqrt{\frac{\log k}{d}}\bigr)$, with reductions showing $\gamma_{k,d} \ge \frac{2}{k}\gamma_{2,d}$. The results connect LCS with list-decoding against deletions and provide rigorous asymptotics that extend the understanding of LCS beyond the classical two-string, binary setting, with implications for related coding-theoretic problems.

Abstract

We study the generalized Chvátal-Sankoff constant $γ_{k,d}$, which represents the normalized expected length of the longest common subsequence (LCS) of $d$ independent uniformly random strings over an alphabet of size $k$. We derive asymptotically tight bounds for $γ_{2,d}$, establishing that $γ_{2,d} = \frac{1}{2} + Θ\left(\frac{1}{\sqrt{d}}\right)$. We also derive asymptotically near-optimal bounds on $γ_{k,d}$ for $d\ge Ω(\log k)$.

Expected Length of the Longest Common Subsequence of Multiple Strings

TL;DR

This work analyzes the generalized Chvátal-Sankoff constant , the normalized expected length of the longest common subsequence for random strings over an alphabet of size . It proves a tight asymptotic for the binary case: , using a greedy diagonal-LCS approach to obtain a strong lower bound and a Guruswami–Wang counting argument for the upper bound. For larger alphabets, it establishes near-optimal bounds when , namely , with reductions showing . The results connect LCS with list-decoding against deletions and provide rigorous asymptotics that extend the understanding of LCS beyond the classical two-string, binary setting, with implications for related coding-theoretic problems.

Abstract

We study the generalized Chvátal-Sankoff constant , which represents the normalized expected length of the longest common subsequence (LCS) of independent uniformly random strings over an alphabet of size . We derive asymptotically tight bounds for , establishing that . We also derive asymptotically near-optimal bounds on for .

Paper Structure

This paper contains 13 sections, 11 theorems, 45 equations, 1 figure, 1 table.

Key Result

Theorem 1.1

There exists constants $0<c_1<c_2$ such that, for all integers $d\ge 2$ we have

Figures (1)

  • Figure 1: Our matching strategy for $d=7$ random binary strings. Because all bits are independent, we can reveal the randomness in any order. We generate 7 random bits. Suppose, as illustrated, 4 bits are a $1$, and $Y=3$ are a 0. We reveal more bits in the strings with 0s until we see 1s. Here, in total, to get 1 LCS bit, we revealed the randomness from $Z=13$ bits across the 7 strings.

Theorems & Definitions (16)

  • Theorem 1.1
  • Theorem 1.2
  • Lemma 3.1
  • proof
  • Lemma 3.2: Hoeffding
  • Lemma 3.3: see, for example, Proposition 3.3.1 of guruswami2019essential
  • Lemma 3.4: see, for example, Proposition 3.3.5 of guruswami2019essential
  • Lemma 3.5: kiwi2008
  • proof : Proof of Theorem \ref{['thm:main']}, lower bound
  • Lemma 4.1: Lemma 2.3 of GuruswamiW14
  • ...and 6 more