Table of Contents
Fetching ...

Improved Lower Bounds on the Expected Length of Longest Common Subsequences

George T. Heineman, Chase Miller, Daniel Reichman, Andrew Salls, Gábor Sárközy, Duncan Soiffer

TL;DR

This paper advances the understanding of the Chvátal-Sankoff constants $\gamma_{\sigma,d}$ by delivering a new, improved lower bound for the canonical binary-two-string case, $\gamma_{2,2}=0.792665992$, and extends gains to a broad range of $(\sigma,d)$. It builds on the Kiwi-Soto lower-bound framework, incorporating parallelization, a novel binary encoding, and a memory-efficient external-memory implementation to scale computations to large string lengths. The key contributions are a high-performance implementation, empirical state-of-the-art lower bounds across multiple parameters, and publicly available code to enable replication and further research. These improvements enhance the practical estimation of $\gamma_{\sigma,d}$ and open paths for tighter bounds and potential structural relations among constants in the LCS literature.

Abstract

It has been proven that, when normalized by $n$, the expected length of a longest common subsequence of $d$ random strings of length $n$ over an alphabet of size $σ$ converges to some constant that depends only on $d$ and $σ$. These values are known as the Chvátal-Sankoff constants, and determining their exact values is a well-known open problem. Upper and lower bounds are known for some combinations of $σ$ and $d$, with the best lower and upper bounds for the most studied case, $σ=2, d=2$, at $0.788071$ and $0.826280$, respectively. Building off previous algorithms for lower-bounding the constants, we implement runtime optimizations, parallelization, and an efficient memory reading and writing scheme to obtain an improved lower bound of $0.792665992$ for $σ=2, d=2$. We additionally improve upon almost all previously reported lower bounds for the Chvátal-Sankoff constants when either the size of alphabet, the number of strings, or both are larger than 2.

Improved Lower Bounds on the Expected Length of Longest Common Subsequences

TL;DR

This paper advances the understanding of the Chvátal-Sankoff constants by delivering a new, improved lower bound for the canonical binary-two-string case, , and extends gains to a broad range of . It builds on the Kiwi-Soto lower-bound framework, incorporating parallelization, a novel binary encoding, and a memory-efficient external-memory implementation to scale computations to large string lengths. The key contributions are a high-performance implementation, empirical state-of-the-art lower bounds across multiple parameters, and publicly available code to enable replication and further research. These improvements enhance the practical estimation of and open paths for tighter bounds and potential structural relations among constants in the LCS literature.

Abstract

It has been proven that, when normalized by , the expected length of a longest common subsequence of random strings of length over an alphabet of size converges to some constant that depends only on and . These values are known as the Chvátal-Sankoff constants, and determining their exact values is a well-known open problem. Upper and lower bounds are known for some combinations of and , with the best lower and upper bounds for the most studied case, , at and , respectively. Building off previous algorithms for lower-bounding the constants, we implement runtime optimizations, parallelization, and an efficient memory reading and writing scheme to obtain an improved lower bound of for . We additionally improve upon almost all previously reported lower bounds for the Chvátal-Sankoff constants when either the size of alphabet, the number of strings, or both are larger than 2.
Paper Structure (17 sections, 1 theorem, 16 equations, 2 figures, 3 tables, 6 algorithms)

This paper contains 17 sections, 1 theorem, 16 equations, 2 figures, 3 tables, 6 algorithms.

Key Result

Lemma 2.1

Suppose a function $F: (\mathbb{R}^{\sigma^{d \ell}})^d \mapsto \mathbb{R}^{\sigma^{d \ell}}$ satisfies the following three properties: Then, for any sequence of $(\mathbf{v_n})_{n\in \mathbb{N}}$ of vectors in $\mathbb{R}^{\sigma ^{d \ell}}$ such that $\mathbf{v_n} \geq F(\mathbf{v_{n-1}}, \ldots, \mathbf{v_{n-d}})$ for all $n \geq d$, there exists a vector $\mathbf{u_0} \in \mathbb{R}^{\sigma

Figures (2)

  • Figure 1: The best upper and lower bounds on $\gamma$ and estimates of $\gamma$ over time.
  • Figure 2: Runtime of the Feasible Triplet and Binary Feasible Triplet algorithms for $\gamma_{2,2}$ as string length parameter $\ell$ increases. Specialization to the binary case improves performance and the presented method of memory I/O reduces the overhead incurred by switching from RAM to disk memory to less than 18x.

Theorems & Definitions (3)

  • Definition 1.1
  • Definition 2.1
  • Lemma 2.1