Almost Linear Size Edit Distance Sketch

Michal Koucký; Michael Saks

Almost Linear Size Edit Distance Sketch

Michal Koucký, Michael Saks

TL;DR

This work presents a novel sketch-and-recover framework for edit distance that achieves near-linear sketch size in the threshold parameter $k$, specifically $O\big(k\,2^{O(\sqrt{\log n\log\log n})}\big)$, while enabling recovery of the exact edit distance (or reporting LARGE) between two strings with high probability. The approach fuses an Ostrovsky–Rabani embedding into $\ell_1$, a hierarchical string decomposition with grammars, and a hierarchical mismatch recovery (HMR) mechanism, plus a randomized superposition scheme to cope with misalignment across fragments. The scheme uses a multilevel decomposition tree and location-tree watermarks to recover both the edit operations and their exact positions, and it achieves polynomial-time sketching and recovery with a controllable failure probability that can be reduced by repetition. These results break the long-standing quadratic-in-$k$ barrier for edit-distance sketches, providing a practically relevant, information-theoretically near-optimal sketch that also outputs an optimal sequence of edits when ED$(x,y)\le k$.

Abstract

Edit distance is an important measure of string similarity. It counts the number of insertions, deletions and substitutions one has to make to a string $x$ to get a string $y$. In this paper we design an almost linear-size sketching scheme for computing edit distance up to a given threshold $k$. The scheme consists of two algorithms, a sketching algorithm and a recovery algorithm. The sketching algorithm depends on the parameter $k$ and takes as input a string $x$ and a public random string $ρ$ and computes a sketch $sk_ρ(x;k)$, which is a digested version of $x$. The recovery algorithm is given two sketches $sk_ρ(x;k)$ and $sk_ρ(y;k)$ as well as the public random string $ρ$ used to create the two sketches, and (with high probability) if the edit distance $ED(x,y)$ between $x$ and $y$ is at most $k$, will output $ED(x,y)$ together with an optimal sequence of edit operations that transforms $x$ to $y$, and if $ED(x,y) > k$ will output LARGE. The size of the sketch output by the sketching algorithm on input $x$ is $k{2^{O(\sqrt{\log(n)\log\log(n)})}}$ (where $n$ is an upper bound on length of $x$). The sketching and recovery algorithms both run in time polynomial in $n$. The dependence of sketch size on $k$ is information theoretically optimal and improves over the quadratic dependence on $k$ in schemes of Kociumaka, Porat and Starikovskaya (FOCS'2021), and Bhattacharya and Koucký (STOC'2023).

Almost Linear Size Edit Distance Sketch

TL;DR

This work presents a novel sketch-and-recover framework for edit distance that achieves near-linear sketch size in the threshold parameter

, specifically

, while enabling recovery of the exact edit distance (or reporting LARGE) between two strings with high probability. The approach fuses an Ostrovsky–Rabani embedding into

, a hierarchical string decomposition with grammars, and a hierarchical mismatch recovery (HMR) mechanism, plus a randomized superposition scheme to cope with misalignment across fragments. The scheme uses a multilevel decomposition tree and location-tree watermarks to recover both the edit operations and their exact positions, and it achieves polynomial-time sketching and recovery with a controllable failure probability that can be reduced by repetition. These results break the long-standing quadratic-in-

barrier for edit-distance sketches, providing a practically relevant, information-theoretically near-optimal sketch that also outputs an optimal sequence of edits when ED

Abstract

Edit distance is an important measure of string similarity. It counts the number of insertions, deletions and substitutions one has to make to a string

to get a string

. In this paper we design an almost linear-size sketching scheme for computing edit distance up to a given threshold

. The scheme consists of two algorithms, a sketching algorithm and a recovery algorithm. The sketching algorithm depends on the parameter

and takes as input a string

and a public random string

and computes a sketch

, which is a digested version of

. The recovery algorithm is given two sketches

and

as well as the public random string

used to create the two sketches, and (with high probability) if the edit distance

between

and

is at most

, will output

together with an optimal sequence of edit operations that transforms

, and if

will output LARGE. The size of the sketch output by the sketching algorithm on input

(where

is an upper bound on length of

). The sketching and recovery algorithms both run in time polynomial in

. The dependence of sketch size on

is information theoretically optimal and improves over the quadratic dependence on

in schemes of Kociumaka, Porat and Starikovskaya (FOCS'2021), and Bhattacharya and Koucký (STOC'2023).

Paper Structure (35 sections, 39 theorems, 39 equations, 10 algorithms)

This paper contains 35 sections, 39 theorems, 39 equations, 10 algorithms.

Introduction
Our technique
Preliminaries
Strings, sequences, trees, and string decomposition
Substrings and fragments
Sequences and Hamming Distance
Trees
String decompositions, and tree decompositions
Edit distance and its representation in grid graphs
Representing edit distance by paths in weighted grids.
Computational Considerations
Function families and randomized functions
Three auxiliary procedures
Fingerprinting
Threshold edit distance fingerprinting
...and 20 more sections

Key Result

Theorem 1.1

There is a randomized sketching algorithm $\textsc{ED-sketch}$ that on an input string $x$ of length at most $n$ with parameter $k<n$ and using a public random string $\rho$ produces a sketch $\textsc{sk}_{\rho}(x)$ of size $O(k 2^{O(\sqrt{\log(n)\log\log(n)})})$, and recovery algorithm $\textsc{ED-

Theorems & Definitions (69)

Theorem 1.1: Sketch for edit distance
Proposition 3.1
Proposition 3.2
proof
Proposition 3.3
Proposition 3.4
Proposition 3.5
proof
Proposition 3.6: Dietzfelbinger Dietzfelbinger96
Lemma 3.7
...and 59 more

Almost Linear Size Edit Distance Sketch

TL;DR

Abstract

Almost Linear Size Edit Distance Sketch

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (69)