Computable Bounds and Monte Carlo Estimates of the Expected Edit Distance
Gianfranco Bilardi, Michele Schimd
TL;DR
This paper investigates the expected edit distance between two random strings over an alphabet of size $k$ by studying $\alpha_k(n)=e_k(n)/n$ and its limit $\alpha_k$. It proves $\alpha_k(n)-Q(n) \le \alpha_k \le \alpha_k(n)$ with a computable, universal bound $Q(n)=\Theta(\sqrt{\log n / n})$, establishing computability of $\alpha_k$ and framing practical estimation approaches. It then develops two complementary strategies: (i) Monte Carlo estimates of $\alpha_k(n)$ with confidence intervals derived via McDiarmid's inequality, enabling accurate estimates for large $n$; (ii) analytical upper and lower bounds on $\alpha_k$ via a coalesced dynamic programming algorithm (CDP) for eccentricity, and a novel ball-size analysis that yields a computable lower bound $\beta_k^*$ with $\lim_{k\to\infty}\beta_k^*=1$. The combination yields improved numerical bounds for many $k$ and $n$, and the work outlines a conjecture on the asymptotic behavior $\lim_{k\to\infty} (1-\alpha_k)k=c_\alpha$ with $c_\alpha\ge 1$. Together, these results advance the computability and estimation of fundamental constants governing the distance between random strings, with implications for seeding biological sequence analysis and nearest-neighbor search applications.
Abstract
The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let $e_k(n)$ denote the average edit distance between random, independent strings of $n$ characters from an alphabet of size $k$. For $k \geq 2$, it is an open problem how to efficiently compute the exact value of $α_{k}(n) = e_k(n)/n$ as well as of $α_{k} = \lim_{n \to \infty} α_{k}(n)$, a limit known to exist. This paper shows that $α_k(n)-Q(n) \leq α_k \leq α_k(n)$, for a specific $Q(n)=Θ(\sqrt{\log n / n})$, a result which implies that $α_k$ is computable. The exact computation of $α_k(n)$ is explored, leading to an algorithm running in time $T=\mathcal{O}(n^2k\min(3^n,k^n))$, a complexity that makes it of limited practical use. An analysis of statistical estimates is proposed, based on McDiarmid's inequality, showing how $α_k(n)$ can be evaluated with good accuracy, high confidence level, and reasonable computation time, for values of $n$ say up to a quarter million. Correspondingly, 99.9\% confidence intervals of width approximately $10^{-2}$ are obtained for $α_k$. Combinatorial arguments on edit scripts are exploited to analytically characterize an efficiently computable lower bound $β_k^*$ to $α_k$, such that $ \lim_{k \to \infty} β_k^*=1$. In general, $β_k^* \leq α_k \leq 1-1/k$; for $k$ greater than a few dozens, computing $β_k^*$ is much faster than generating good statistical estimates with confidence intervals of width $1-1/k-β_k^*$. The techniques developed in the paper yield improvements on most previously published numerical values as well as results for alphabet sizes and string lengths not reported before.
