Computable Bounds and Monte Carlo Estimates of the Expected Edit Distance

Gianfranco Bilardi; Michele Schimd

Computable Bounds and Monte Carlo Estimates of the Expected Edit Distance

Gianfranco Bilardi, Michele Schimd

TL;DR

This paper investigates the expected edit distance between two random strings over an alphabet of size $k$ by studying $\alpha_k(n)=e_k(n)/n$ and its limit $\alpha_k$. It proves $\alpha_k(n)-Q(n) \le \alpha_k \le \alpha_k(n)$ with a computable, universal bound $Q(n)=\Theta(\sqrt{\log n / n})$, establishing computability of $\alpha_k$ and framing practical estimation approaches. It then develops two complementary strategies: (i) Monte Carlo estimates of $\alpha_k(n)$ with confidence intervals derived via McDiarmid's inequality, enabling accurate estimates for large $n$; (ii) analytical upper and lower bounds on $\alpha_k$ via a coalesced dynamic programming algorithm (CDP) for eccentricity, and a novel ball-size analysis that yields a computable lower bound $\beta_k^*$ with $\lim_{k\to\infty}\beta_k^*=1$. The combination yields improved numerical bounds for many $k$ and $n$, and the work outlines a conjecture on the asymptotic behavior $\lim_{k\to\infty} (1-\alpha_k)k=c_\alpha$ with $c_\alpha\ge 1$. Together, these results advance the computability and estimation of fundamental constants governing the distance between random strings, with implications for seeding biological sequence analysis and nearest-neighbor search applications.

Abstract

The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let $e_k(n)$ denote the average edit distance between random, independent strings of $n$ characters from an alphabet of size $k$. For $k \geq 2$, it is an open problem how to efficiently compute the exact value of $α_{k}(n) = e_k(n)/n$ as well as of $α_{k} = \lim_{n \to \infty} α_{k}(n)$, a limit known to exist. This paper shows that $α_k(n)-Q(n) \leq α_k \leq α_k(n)$, for a specific $Q(n)=Θ(\sqrt{\log n / n})$, a result which implies that $α_k$ is computable. The exact computation of $α_k(n)$ is explored, leading to an algorithm running in time $T=\mathcal{O}(n^2k\min(3^n,k^n))$, a complexity that makes it of limited practical use. An analysis of statistical estimates is proposed, based on McDiarmid's inequality, showing how $α_k(n)$ can be evaluated with good accuracy, high confidence level, and reasonable computation time, for values of $n$ say up to a quarter million. Correspondingly, 99.9\% confidence intervals of width approximately $10^{-2}$ are obtained for $α_k$. Combinatorial arguments on edit scripts are exploited to analytically characterize an efficiently computable lower bound $β_k^*$ to $α_k$, such that $ \lim_{k \to \infty} β_k^*=1$. In general, $β_k^* \leq α_k \leq 1-1/k$; for $k$ greater than a few dozens, computing $β_k^*$ is much faster than generating good statistical estimates with confidence intervals of width $1-1/k-β_k^*$. The techniques developed in the paper yield improvements on most previously published numerical values as well as results for alphabet sizes and string lengths not reported before.

Computable Bounds and Monte Carlo Estimates of the Expected Edit Distance

TL;DR

This paper investigates the expected edit distance between two random strings over an alphabet of size

by studying

and its limit

. It proves

with a computable, universal bound

, establishing computability of

and framing practical estimation approaches. It then develops two complementary strategies: (i) Monte Carlo estimates of

with confidence intervals derived via McDiarmid's inequality, enabling accurate estimates for large

; (ii) analytical upper and lower bounds on

via a coalesced dynamic programming algorithm (CDP) for eccentricity, and a novel ball-size analysis that yields a computable lower bound

with

. The combination yields improved numerical bounds for many

and

, and the work outlines a conjecture on the asymptotic behavior

with

. Together, these results advance the computability and estimation of fundamental constants governing the distance between random strings, with implications for seeding biological sequence analysis and nearest-neighbor search applications.

Abstract

The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let

denote the average edit distance between random, independent strings of

characters from an alphabet of size

. For

, it is an open problem how to efficiently compute the exact value of

as well as of

, a limit known to exist. This paper shows that

, for a specific

, a result which implies that

is computable. The exact computation of

is explored, leading to an algorithm running in time

, a complexity that makes it of limited practical use. An analysis of statistical estimates is proposed, based on McDiarmid's inequality, showing how

can be evaluated with good accuracy, high confidence level, and reasonable computation time, for values of

say up to a quarter million. Correspondingly, 99.9\% confidence intervals of width approximately

are obtained for

. Combinatorial arguments on edit scripts are exploited to analytically characterize an efficiently computable lower bound

, such that

. In general,

; for

greater than a few dozens, computing

is much faster than generating good statistical estimates with confidence intervals of width

. The techniques developed in the paper yield improvements on most previously published numerical values as well as results for alphabet sizes and string lengths not reported before.

Paper Structure (26 sections, 20 theorems, 67 equations, 1 figure, 9 tables, 1 algorithm)

This paper contains 26 sections, 20 theorems, 67 equations, 1 figure, 9 tables, 1 algorithm.

Introduction
Preliminaries
Notation and definitions
Computing the edit distance
Rate of convergence and computability of $\alpha_k$
Monte Carlo estimates of $\alpha_k$
$\alpha_k(n)$
Remark
$\alpha_k$
remark
Upper bounds for $\alpha_k$
The coalesced dynamic programming algorithm for eccentricity
remark
Exploiting symmetries of $\mathop{\mathrm{ecc}}\nolimits(x)$ in the computation of $e_k(n)$
Lower bounds for $\alpha_k$
...and 11 more sections

Key Result

Lemma 2.1

With the preceding notation, if $\mathcal{S}$ is a simple script to transform $x$ into $y$, with $|x|=|y|=n$, and $(\mathcal{I},\mathcal{J})=a(\mathcal{S})$ is the corresponding alignment, its cost is

Figures (1)

Figure 1: Illustration of the proof of Proposition \ref{['prop:G-beta-bound']}.

Theorems & Definitions (41)

Lemma 2.1
proof
Definition 3.1: Computability of a real number
Definition 3.2
Proposition 3.1
proof
Theorem 3.2: LMT12
Theorem 3.3
proof
Proposition 4.1: McD89
...and 31 more

Computable Bounds and Monte Carlo Estimates of the Expected Edit Distance

TL;DR

Abstract

Computable Bounds and Monte Carlo Estimates of the Expected Edit Distance

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (41)