Almost Linear Size Edit Distance Sketch
Michal Koucký, Michael Saks
TL;DR
This work presents a novel sketch-and-recover framework for edit distance that achieves near-linear sketch size in the threshold parameter $k$, specifically $O\big(k\,2^{O(\sqrt{\log n\log\log n})}\big)$, while enabling recovery of the exact edit distance (or reporting LARGE) between two strings with high probability. The approach fuses an Ostrovsky–Rabani embedding into $\ell_1$, a hierarchical string decomposition with grammars, and a hierarchical mismatch recovery (HMR) mechanism, plus a randomized superposition scheme to cope with misalignment across fragments. The scheme uses a multilevel decomposition tree and location-tree watermarks to recover both the edit operations and their exact positions, and it achieves polynomial-time sketching and recovery with a controllable failure probability that can be reduced by repetition. These results break the long-standing quadratic-in-$k$ barrier for edit-distance sketches, providing a practically relevant, information-theoretically near-optimal sketch that also outputs an optimal sequence of edits when ED$(x,y)\le k$.
Abstract
Edit distance is an important measure of string similarity. It counts the number of insertions, deletions and substitutions one has to make to a string $x$ to get a string $y$. In this paper we design an almost linear-size sketching scheme for computing edit distance up to a given threshold $k$. The scheme consists of two algorithms, a sketching algorithm and a recovery algorithm. The sketching algorithm depends on the parameter $k$ and takes as input a string $x$ and a public random string $ρ$ and computes a sketch $sk_ρ(x;k)$, which is a digested version of $x$. The recovery algorithm is given two sketches $sk_ρ(x;k)$ and $sk_ρ(y;k)$ as well as the public random string $ρ$ used to create the two sketches, and (with high probability) if the edit distance $ED(x,y)$ between $x$ and $y$ is at most $k$, will output $ED(x,y)$ together with an optimal sequence of edit operations that transforms $x$ to $y$, and if $ED(x,y) > k$ will output LARGE. The size of the sketch output by the sketching algorithm on input $x$ is $k{2^{O(\sqrt{\log(n)\log\log(n)})}}$ (where $n$ is an upper bound on length of $x$). The sketching and recovery algorithms both run in time polynomial in $n$. The dependence of sketch size on $k$ is information theoretically optimal and improves over the quadratic dependence on $k$ in schemes of Kociumaka, Porat and Starikovskaya (FOCS'2021), and Bhattacharya and Koucký (STOC'2023).
