Improved Lower Bounds on the Expected Length of Longest Common Subsequences

George T. Heineman; Chase Miller; Daniel Reichman; Andrew Salls; Gábor Sárközy; Duncan Soiffer

Improved Lower Bounds on the Expected Length of Longest Common Subsequences

George T. Heineman, Chase Miller, Daniel Reichman, Andrew Salls, Gábor Sárközy, Duncan Soiffer

TL;DR

This paper advances the understanding of the Chvátal-Sankoff constants $\gamma_{\sigma,d}$ by delivering a new, improved lower bound for the canonical binary-two-string case, $\gamma_{2,2}=0.792665992$, and extends gains to a broad range of $(\sigma,d)$. It builds on the Kiwi-Soto lower-bound framework, incorporating parallelization, a novel binary encoding, and a memory-efficient external-memory implementation to scale computations to large string lengths. The key contributions are a high-performance implementation, empirical state-of-the-art lower bounds across multiple parameters, and publicly available code to enable replication and further research. These improvements enhance the practical estimation of $\gamma_{\sigma,d}$ and open paths for tighter bounds and potential structural relations among constants in the LCS literature.

Abstract

It has been proven that, when normalized by $n$, the expected length of a longest common subsequence of $d$ random strings of length $n$ over an alphabet of size $σ$ converges to some constant that depends only on $d$ and $σ$. These values are known as the Chvátal-Sankoff constants, and determining their exact values is a well-known open problem. Upper and lower bounds are known for some combinations of $σ$ and $d$, with the best lower and upper bounds for the most studied case, $σ=2, d=2$, at $0.788071$ and $0.826280$, respectively. Building off previous algorithms for lower-bounding the constants, we implement runtime optimizations, parallelization, and an efficient memory reading and writing scheme to obtain an improved lower bound of $0.792665992$ for $σ=2, d=2$. We additionally improve upon almost all previously reported lower bounds for the Chvátal-Sankoff constants when either the size of alphabet, the number of strings, or both are larger than 2.

Improved Lower Bounds on the Expected Length of Longest Common Subsequences

TL;DR

This paper advances the understanding of the Chvátal-Sankoff constants

by delivering a new, improved lower bound for the canonical binary-two-string case,

, and extends gains to a broad range of

. It builds on the Kiwi-Soto lower-bound framework, incorporating parallelization, a novel binary encoding, and a memory-efficient external-memory implementation to scale computations to large string lengths. The key contributions are a high-performance implementation, empirical state-of-the-art lower bounds across multiple parameters, and publicly available code to enable replication and further research. These improvements enhance the practical estimation of

and open paths for tighter bounds and potential structural relations among constants in the LCS literature.

Abstract

It has been proven that, when normalized by

, the expected length of a longest common subsequence of

random strings of length

over an alphabet of size

converges to some constant that depends only on

and

. These values are known as the Chvátal-Sankoff constants, and determining their exact values is a well-known open problem. Upper and lower bounds are known for some combinations of

and

, with the best lower and upper bounds for the most studied case,

, at

and

, respectively. Building off previous algorithms for lower-bounding the constants, we implement runtime optimizations, parallelization, and an efficient memory reading and writing scheme to obtain an improved lower bound of

for

. We additionally improve upon almost all previously reported lower bounds for the Chvátal-Sankoff constants when either the size of alphabet, the number of strings, or both are larger than 2.

Paper Structure (17 sections, 1 theorem, 16 equations, 2 figures, 3 tables, 6 algorithms)

This paper contains 17 sections, 1 theorem, 16 equations, 2 figures, 3 tables, 6 algorithms.

Introduction
Background and Related Work
The Kiwi-Soto Algorithm
The Binary Case
Implementation Details
Parallelization
Indexing
Array Reductions and Symmetries
Sequential Memory Access
L
L and its Recursion
L and its Recursion
Results
Conclusion
Program Code and Additional Material
...and 2 more sections

Key Result

Lemma 2.1

Suppose a function $F: (\mathbb{R}^{\sigma^{d \ell}})^d \mapsto \mathbb{R}^{\sigma^{d \ell}}$ satisfies the following three properties: Then, for any sequence of $(\mathbf{v_n})_{n\in \mathbb{N}}$ of vectors in $\mathbb{R}^{\sigma ^{d \ell}}$ such that $\mathbf{v_n} \geq F(\mathbf{v_{n-1}}, \ldots, \mathbf{v_{n-d}})$ for all $n \geq d$, there exists a vector $\mathbf{u_0} \in \mathbb{R}^{\sigma

Figures (2)

Figure 1: The best upper and lower bounds on $\gamma$ and estimates of $\gamma$ over time.
Figure 2: Runtime of the Feasible Triplet and Binary Feasible Triplet algorithms for $\gamma_{2,2}$ as string length parameter $\ell$ increases. Specialization to the binary case improves performance and the presented method of memory I/O reduces the overhead incurred by switching from RAM to disk memory to less than 18x.

Theorems & Definitions (3)

Definition 1.1
Definition 2.1
Lemma 2.1

Improved Lower Bounds on the Expected Length of Longest Common Subsequences

TL;DR

Abstract

Improved Lower Bounds on the Expected Length of Longest Common Subsequences

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (3)