Table of Contents
Fetching ...

Computable Bounds and Monte Carlo Estimates of the Expected Edit Distance

Gianfranco Bilardi, Michele Schimd

TL;DR

This paper investigates the expected edit distance between two random strings over an alphabet of size $k$ by studying $\alpha_k(n)=e_k(n)/n$ and its limit $\alpha_k$. It proves $\alpha_k(n)-Q(n) \le \alpha_k \le \alpha_k(n)$ with a computable, universal bound $Q(n)=\Theta(\sqrt{\log n / n})$, establishing computability of $\alpha_k$ and framing practical estimation approaches. It then develops two complementary strategies: (i) Monte Carlo estimates of $\alpha_k(n)$ with confidence intervals derived via McDiarmid's inequality, enabling accurate estimates for large $n$; (ii) analytical upper and lower bounds on $\alpha_k$ via a coalesced dynamic programming algorithm (CDP) for eccentricity, and a novel ball-size analysis that yields a computable lower bound $\beta_k^*$ with $\lim_{k\to\infty}\beta_k^*=1$. The combination yields improved numerical bounds for many $k$ and $n$, and the work outlines a conjecture on the asymptotic behavior $\lim_{k\to\infty} (1-\alpha_k)k=c_\alpha$ with $c_\alpha\ge 1$. Together, these results advance the computability and estimation of fundamental constants governing the distance between random strings, with implications for seeding biological sequence analysis and nearest-neighbor search applications.

Abstract

The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let $e_k(n)$ denote the average edit distance between random, independent strings of $n$ characters from an alphabet of size $k$. For $k \geq 2$, it is an open problem how to efficiently compute the exact value of $α_{k}(n) = e_k(n)/n$ as well as of $α_{k} = \lim_{n \to \infty} α_{k}(n)$, a limit known to exist. This paper shows that $α_k(n)-Q(n) \leq α_k \leq α_k(n)$, for a specific $Q(n)=Θ(\sqrt{\log n / n})$, a result which implies that $α_k$ is computable. The exact computation of $α_k(n)$ is explored, leading to an algorithm running in time $T=\mathcal{O}(n^2k\min(3^n,k^n))$, a complexity that makes it of limited practical use. An analysis of statistical estimates is proposed, based on McDiarmid's inequality, showing how $α_k(n)$ can be evaluated with good accuracy, high confidence level, and reasonable computation time, for values of $n$ say up to a quarter million. Correspondingly, 99.9\% confidence intervals of width approximately $10^{-2}$ are obtained for $α_k$. Combinatorial arguments on edit scripts are exploited to analytically characterize an efficiently computable lower bound $β_k^*$ to $α_k$, such that $ \lim_{k \to \infty} β_k^*=1$. In general, $β_k^* \leq α_k \leq 1-1/k$; for $k$ greater than a few dozens, computing $β_k^*$ is much faster than generating good statistical estimates with confidence intervals of width $1-1/k-β_k^*$. The techniques developed in the paper yield improvements on most previously published numerical values as well as results for alphabet sizes and string lengths not reported before.

Computable Bounds and Monte Carlo Estimates of the Expected Edit Distance

TL;DR

This paper investigates the expected edit distance between two random strings over an alphabet of size by studying and its limit . It proves with a computable, universal bound , establishing computability of and framing practical estimation approaches. It then develops two complementary strategies: (i) Monte Carlo estimates of with confidence intervals derived via McDiarmid's inequality, enabling accurate estimates for large ; (ii) analytical upper and lower bounds on via a coalesced dynamic programming algorithm (CDP) for eccentricity, and a novel ball-size analysis that yields a computable lower bound with . The combination yields improved numerical bounds for many and , and the work outlines a conjecture on the asymptotic behavior with . Together, these results advance the computability and estimation of fundamental constants governing the distance between random strings, with implications for seeding biological sequence analysis and nearest-neighbor search applications.

Abstract

The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let denote the average edit distance between random, independent strings of characters from an alphabet of size . For , it is an open problem how to efficiently compute the exact value of as well as of , a limit known to exist. This paper shows that , for a specific , a result which implies that is computable. The exact computation of is explored, leading to an algorithm running in time , a complexity that makes it of limited practical use. An analysis of statistical estimates is proposed, based on McDiarmid's inequality, showing how can be evaluated with good accuracy, high confidence level, and reasonable computation time, for values of say up to a quarter million. Correspondingly, 99.9\% confidence intervals of width approximately are obtained for . Combinatorial arguments on edit scripts are exploited to analytically characterize an efficiently computable lower bound to , such that . In general, ; for greater than a few dozens, computing is much faster than generating good statistical estimates with confidence intervals of width . The techniques developed in the paper yield improvements on most previously published numerical values as well as results for alphabet sizes and string lengths not reported before.
Paper Structure (26 sections, 20 theorems, 67 equations, 1 figure, 9 tables, 1 algorithm)

This paper contains 26 sections, 20 theorems, 67 equations, 1 figure, 9 tables, 1 algorithm.

Key Result

Lemma 2.1

With the preceding notation, if $\mathcal{S}$ is a simple script to transform $x$ into $y$, with $|x|=|y|=n$, and $(\mathcal{I},\mathcal{J})=a(\mathcal{S})$ is the corresponding alignment, its cost is

Figures (1)

  • Figure 1: Illustration of the proof of Proposition \ref{['prop:G-beta-bound']}.

Theorems & Definitions (41)

  • Lemma 2.1
  • proof
  • Definition 3.1: Computability of a real number
  • Definition 3.2
  • Proposition 3.1
  • proof
  • Theorem 3.2: LMT12
  • Theorem 3.3
  • proof
  • Proposition 4.1: McD89
  • ...and 31 more