Table of Contents
Fetching ...

Almost Linear Size Edit Distance Sketch

Michal Koucký, Michael Saks

TL;DR

This work presents a novel sketch-and-recover framework for edit distance that achieves near-linear sketch size in the threshold parameter $k$, specifically $O\big(k\,2^{O(\sqrt{\log n\log\log n})}\big)$, while enabling recovery of the exact edit distance (or reporting LARGE) between two strings with high probability. The approach fuses an Ostrovsky–Rabani embedding into $\ell_1$, a hierarchical string decomposition with grammars, and a hierarchical mismatch recovery (HMR) mechanism, plus a randomized superposition scheme to cope with misalignment across fragments. The scheme uses a multilevel decomposition tree and location-tree watermarks to recover both the edit operations and their exact positions, and it achieves polynomial-time sketching and recovery with a controllable failure probability that can be reduced by repetition. These results break the long-standing quadratic-in-$k$ barrier for edit-distance sketches, providing a practically relevant, information-theoretically near-optimal sketch that also outputs an optimal sequence of edits when ED$(x,y)\le k$.

Abstract

Edit distance is an important measure of string similarity. It counts the number of insertions, deletions and substitutions one has to make to a string $x$ to get a string $y$. In this paper we design an almost linear-size sketching scheme for computing edit distance up to a given threshold $k$. The scheme consists of two algorithms, a sketching algorithm and a recovery algorithm. The sketching algorithm depends on the parameter $k$ and takes as input a string $x$ and a public random string $ρ$ and computes a sketch $sk_ρ(x;k)$, which is a digested version of $x$. The recovery algorithm is given two sketches $sk_ρ(x;k)$ and $sk_ρ(y;k)$ as well as the public random string $ρ$ used to create the two sketches, and (with high probability) if the edit distance $ED(x,y)$ between $x$ and $y$ is at most $k$, will output $ED(x,y)$ together with an optimal sequence of edit operations that transforms $x$ to $y$, and if $ED(x,y) > k$ will output LARGE. The size of the sketch output by the sketching algorithm on input $x$ is $k{2^{O(\sqrt{\log(n)\log\log(n)})}}$ (where $n$ is an upper bound on length of $x$). The sketching and recovery algorithms both run in time polynomial in $n$. The dependence of sketch size on $k$ is information theoretically optimal and improves over the quadratic dependence on $k$ in schemes of Kociumaka, Porat and Starikovskaya (FOCS'2021), and Bhattacharya and Koucký (STOC'2023).

Almost Linear Size Edit Distance Sketch

TL;DR

This work presents a novel sketch-and-recover framework for edit distance that achieves near-linear sketch size in the threshold parameter , specifically , while enabling recovery of the exact edit distance (or reporting LARGE) between two strings with high probability. The approach fuses an Ostrovsky–Rabani embedding into , a hierarchical string decomposition with grammars, and a hierarchical mismatch recovery (HMR) mechanism, plus a randomized superposition scheme to cope with misalignment across fragments. The scheme uses a multilevel decomposition tree and location-tree watermarks to recover both the edit operations and their exact positions, and it achieves polynomial-time sketching and recovery with a controllable failure probability that can be reduced by repetition. These results break the long-standing quadratic-in- barrier for edit-distance sketches, providing a practically relevant, information-theoretically near-optimal sketch that also outputs an optimal sequence of edits when ED.

Abstract

Edit distance is an important measure of string similarity. It counts the number of insertions, deletions and substitutions one has to make to a string to get a string . In this paper we design an almost linear-size sketching scheme for computing edit distance up to a given threshold . The scheme consists of two algorithms, a sketching algorithm and a recovery algorithm. The sketching algorithm depends on the parameter and takes as input a string and a public random string and computes a sketch , which is a digested version of . The recovery algorithm is given two sketches and as well as the public random string used to create the two sketches, and (with high probability) if the edit distance between and is at most , will output together with an optimal sequence of edit operations that transforms to , and if will output LARGE. The size of the sketch output by the sketching algorithm on input is (where is an upper bound on length of ). The sketching and recovery algorithms both run in time polynomial in . The dependence of sketch size on is information theoretically optimal and improves over the quadratic dependence on in schemes of Kociumaka, Porat and Starikovskaya (FOCS'2021), and Bhattacharya and Koucký (STOC'2023).
Paper Structure (35 sections, 39 theorems, 39 equations, 10 algorithms)

This paper contains 35 sections, 39 theorems, 39 equations, 10 algorithms.

Key Result

Theorem 1.1

There is a randomized sketching algorithm $\textsc{ED-sketch}$ that on an input string $x$ of length at most $n$ with parameter $k<n$ and using a public random string $\rho$ produces a sketch $\textsc{sk}_{\rho}(x)$ of size $O(k 2^{O(\sqrt{\log(n)\log\log(n)})})$, and recovery algorithm $\textsc{ED-

Theorems & Definitions (69)

  • Theorem 1.1: Sketch for edit distance
  • Proposition 3.1
  • Proposition 3.2
  • proof
  • Proposition 3.3
  • Proposition 3.4
  • Proposition 3.5
  • proof
  • Proposition 3.6: Dietzfelbinger Dietzfelbinger96
  • Lemma 3.7
  • ...and 59 more