Upper bounds on the average edit distance between two random strings

Matthieu Rosenfeld

Upper bounds on the average edit distance between two random strings

Matthieu Rosenfeld

TL;DR

This work analyzes the average similarity between two random strings under two metrics: the edit distance $d_e$ and the LCS length, encapsulated by the constants $\alpha_k$ and $\gamma_k$. It adapts Lueker's technique to produce improved upper bounds on $\alpha_k$ for small alphabets and implements a novel method that strengthens lower bounds on $\gamma_k$ for most small alphabets, using a fixed-point framework with vector recurrences $V_n(s,t)$ and $W_n(s,t)$. The central tool is a monotone transformation $T$ that, together with a witness vector, yields computable finite-precision bounds; the approach is realized in three code variants tailored to binary, general, and large alphabets. The results provide new upper bounds on $\alpha_k$ and improved lower bounds on $\gamma_k$ across several small alphabets, with binary-case benchmarks for $\alpha_2$ and $\gamma_2$ and comparisons to prior work, including recent improvements. Overall, the paper advances understanding of random-string similarity measures and offers practical computational techniques with potential impact on DNA reconstruction and related sequence-analysis problems.

Abstract

We study the average edit distance between two random strings. More precisely, we adapt a technique introduced by Lueker in the context of the average longest common subsequence of two random strings to improve the known upper bound on the average edit distance. We improve all the known upper bounds for small alphabets. We also provide a new implementation of Lueker technique to improve the lower bound on the average length of the longest common subsequence of two random strings for all small alphabets of size other than $2$ and $4$.

Upper bounds on the average edit distance between two random strings

TL;DR

This work analyzes the average similarity between two random strings under two metrics: the edit distance

and the LCS length, encapsulated by the constants

and

. It adapts Lueker's technique to produce improved upper bounds on

for small alphabets and implements a novel method that strengthens lower bounds on

for most small alphabets, using a fixed-point framework with vector recurrences

and

. The central tool is a monotone transformation

that, together with a witness vector, yields computable finite-precision bounds; the approach is realized in three code variants tailored to binary, general, and large alphabets. The results provide new upper bounds on

and improved lower bounds on

across several small alphabets, with binary-case benchmarks for

and

and comparisons to prior work, including recent improvements. Overall, the paper advances understanding of random-string similarity measures and offers practical computational techniques with potential impact on DNA reconstruction and related sequence-analysis problems.

Abstract

and

Paper Structure (10 sections, 7 theorems, 41 equations, 7 tables)

This paper contains 10 sections, 7 theorems, 41 equations, 7 tables.

Introduction
Average edit distance
From to
Bounds on
Implementation and bounds
The binary case.
The general case.
Large alphabets.
Longest common subsequence
Implementation and results

Key Result

Theorem 1

Let $f: \mathcal{X}_1\times \mathcal{X}_2\times\cdots\times \mathcal{X}_m\rightarrow\mathbb{R}$ be a function with the property that changing any one argument of $f$ while holding the others fixed changes the value of $f$ by at most $c$. Consider independent random variables $X_1,\ldots, X_m$ where and

Theorems & Definitions (11)

Theorem 1: McDiarmid's inequality
Lemma 2
Lemma 3
proof
Lemma 4
proof
proof
Lemma 7
proof
Lemma 8
...and 1 more

Upper bounds on the average edit distance between two random strings

TL;DR

Abstract

Upper bounds on the average edit distance between two random strings

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (11)