Table of Contents
Fetching ...

Many Flavors of Edit Distance

Sudatta Bhattacharya, Sanjana Dey, Elazar Goldenberg, Michal Koucký

TL;DR

This paper demonstrates the capability to reduce questions regarding string similarity over arbitrary alphabets to equivalent questions over a binary alphabet and illustrates how to transform questions concerning indel distance into equivalent questions based on edit distance.

Abstract

Several measures exist for string similarity, including notable ones like the edit distance and the indel distance. The former measures the count of insertions, deletions, and substitutions required to transform one string into another, while the latter specifically quantifies the number of insertions and deletions. Many algorithmic solutions explicitly address one of these measures, and frequently techniques applicable to one can also be adapted to work with the other. In this paper, we investigate whether there exists a standardized approach for applying results from one setting to another. Specifically, we demonstrate the capability to reduce questions regarding string similarity over arbitrary alphabets to equivalent questions over a binary alphabet. Furthermore, we illustrate how to transform questions concerning indel distance into equivalent questions based on edit distance. This complements an earlier result of Tiskin (2007) which addresses the inverse direction.

Many Flavors of Edit Distance

TL;DR

This paper demonstrates the capability to reduce questions regarding string similarity over arbitrary alphabets to equivalent questions over a binary alphabet and illustrates how to transform questions concerning indel distance into equivalent questions based on edit distance.

Abstract

Several measures exist for string similarity, including notable ones like the edit distance and the indel distance. The former measures the count of insertions, deletions, and substitutions required to transform one string into another, while the latter specifically quantifies the number of insertions and deletions. Many algorithmic solutions explicitly address one of these measures, and frequently techniques applicable to one can also be adapted to work with the other. In this paper, we investigate whether there exists a standardized approach for applying results from one setting to another. Specifically, we demonstrate the capability to reduce questions regarding string similarity over arbitrary alphabets to equivalent questions over a binary alphabet. Furthermore, we illustrate how to transform questions concerning indel distance into equivalent questions based on edit distance. This complements an earlier result of Tiskin (2007) which addresses the inverse direction.

Paper Structure

This paper contains 24 sections, 14 theorems, 24 equations, 5 figures, 2 algorithms.

Key Result

Theorem 1.1

Let $\Gamma$ be a finite alphabet, and let $0<\varepsilon<1/4$. There exists an alphabet $\Sigma$, where $\lvert \Sigma \rvert = O (\frac{1}{\varepsilon^2})$ and there exists $E:\Gamma^*\to \Sigma^*$ satisfying: Moreover, for every $X\in \Gamma^n$ we have: $\lvert E(X) \rvert=O(n\log (\lvert \Gamma \rvert))$.

Figures (5)

  • Figure 1: Example for (a) $\Delta_{indel}$ alignment and (b) $\Delta_{edit}$ alignment (the matched characters are highlighted in blue, the deleted characters in red and the substituted characters in orange).
  • Figure 2: An illustration of the matching between the strings $E(X)$ and $E(Y)$. Arrows indicate matching coordinates, and dashed lines represent the beginning/end points of the segments. The first segment starts at $(1,1)$ and ends at $(2,3)$, as the first matched coordinate in the second block of $E(X)$ is mapped to the third block, and no character from the first block of $E(X)$ is mapped to that block. The second segment starts at $(3,1)$ and ends at $(4,1)$ (as the first matched coordinate in the second block of $E(X)$ is mapped to the third block, and there exists a character from the second block of $E(X)$ that is mapped to that block)
  • Figure 3: Block $i'$ partially matches blocks $j_1, j_2, j_3$ in $E(Y)$. Consequently, no other block in $E(X)$ can be partially matched with $j_2$. The red line illustrates the prohibited matching.
  • Figure 4: On the left side, we have the decomposition of $X$ and $Y$ based on the $\Delta_{indel}$ alignment. On the right side, we see the decomposition and alignment of $X$ and $Y'$ following our construction in Section \ref{['sec:indel-edit-exact']}. The solid red arrows indicate that all characters of $m^{X}_i$ are matched, while the dotted red arrows suggest that the characters of $m^{X}_i$ are substituted. The shaded cells in gray indicate deletions. On the right-hand side, the blue dotted lines indicate the alignment of $m^{X}_i$ with $m^{Y'}_i$.
  • Figure 5: On the left side, we have the decomposition of two strings $X$ and $Y$ of length 20 each based on the $\Delta_{indel}$ alignment. On the right side, we see the decomposition and alignment of $E_1(X)=X$ and $E_3(Y)=\widetilde{Y}$ as constructed by our Algorithm \ref{['alg:indel-edit-apx']}, where $k=1$. Matching between characters is indicated by solid red arrows, while substitutions are denoted by dotted red arrows. The deleted cells or characters are shaded in grey while the substituted cells are shaded in green.

Theorems & Definitions (31)

  • Theorem 1.1: Alphabet Reduction - Succinct Embedding
  • Corollary 1.1
  • Theorem 1.2: Informal statement of Theorem \ref{['thm:formulas']}
  • Theorem 1.3: Indel Into Edit Metrics Embedding - Approximate embedding -- Statement of Theorem \ref{['thm:indel-edit-apx']}
  • Definition 1: A Robust Notion of Approximation:
  • Claim 3.1
  • Theorem 3.1: Alphabet Reduction - Succinct Embedding
  • Claim 3.2
  • Lemma 3.3
  • Lemma 3.4
  • ...and 21 more