Table of Contents
Fetching ...

Upper bounds on the average edit distance between two random strings

Matthieu Rosenfeld

TL;DR

This work analyzes the average similarity between two random strings under two metrics: the edit distance $d_e$ and the LCS length, encapsulated by the constants $\alpha_k$ and $\gamma_k$. It adapts Lueker's technique to produce improved upper bounds on $\alpha_k$ for small alphabets and implements a novel method that strengthens lower bounds on $\gamma_k$ for most small alphabets, using a fixed-point framework with vector recurrences $V_n(s,t)$ and $W_n(s,t)$. The central tool is a monotone transformation $T$ that, together with a witness vector, yields computable finite-precision bounds; the approach is realized in three code variants tailored to binary, general, and large alphabets. The results provide new upper bounds on $\alpha_k$ and improved lower bounds on $\gamma_k$ across several small alphabets, with binary-case benchmarks for $\alpha_2$ and $\gamma_2$ and comparisons to prior work, including recent improvements. Overall, the paper advances understanding of random-string similarity measures and offers practical computational techniques with potential impact on DNA reconstruction and related sequence-analysis problems.

Abstract

We study the average edit distance between two random strings. More precisely, we adapt a technique introduced by Lueker in the context of the average longest common subsequence of two random strings to improve the known upper bound on the average edit distance. We improve all the known upper bounds for small alphabets. We also provide a new implementation of Lueker technique to improve the lower bound on the average length of the longest common subsequence of two random strings for all small alphabets of size other than $2$ and $4$.

Upper bounds on the average edit distance between two random strings

TL;DR

This work analyzes the average similarity between two random strings under two metrics: the edit distance and the LCS length, encapsulated by the constants and . It adapts Lueker's technique to produce improved upper bounds on for small alphabets and implements a novel method that strengthens lower bounds on for most small alphabets, using a fixed-point framework with vector recurrences and . The central tool is a monotone transformation that, together with a witness vector, yields computable finite-precision bounds; the approach is realized in three code variants tailored to binary, general, and large alphabets. The results provide new upper bounds on and improved lower bounds on across several small alphabets, with binary-case benchmarks for and and comparisons to prior work, including recent improvements. Overall, the paper advances understanding of random-string similarity measures and offers practical computational techniques with potential impact on DNA reconstruction and related sequence-analysis problems.

Abstract

We study the average edit distance between two random strings. More precisely, we adapt a technique introduced by Lueker in the context of the average longest common subsequence of two random strings to improve the known upper bound on the average edit distance. We improve all the known upper bounds for small alphabets. We also provide a new implementation of Lueker technique to improve the lower bound on the average length of the longest common subsequence of two random strings for all small alphabets of size other than and .
Paper Structure (10 sections, 7 theorems, 41 equations, 7 tables)

This paper contains 10 sections, 7 theorems, 41 equations, 7 tables.

Key Result

Theorem 1

Let $f: \mathcal{X}_1\times \mathcal{X}_2\times\cdots\times \mathcal{X}_m\rightarrow\mathbb{R}$ be a function with the property that changing any one argument of $f$ while holding the others fixed changes the value of $f$ by at most $c$. Consider independent random variables $X_1,\ldots, X_m$ where and

Theorems & Definitions (11)

  • Theorem 1: McDiarmid's inequality
  • Lemma 2
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • proof
  • Lemma 7
  • proof
  • Lemma 8
  • ...and 1 more