Table of Contents
Fetching ...

Levenshtein Distance Embedding with Poisson Regression for DNA Storage

Xiang Wei, Alan J. X. Guo, Sihan Sun, Mengyi Wei, Wei Yu

TL;DR

This work tackles efficient approximation of the Levenshtein distance by learning a neural embedding via Poisson regression. It develops a Siamese embedding framework, analyzes how embedding dimension controls approximation variance, and introduces the early-stopping dimension (ESD) to bound dimension growth. The Poisson-regression objective (PNLL) aligns naturally with the distance and asymptotically matches a chi-squared likelihood, yielding improved performance over prior losses, especially for homologous sequence pairs in DNA storage data. Empirical results show PNLL with ESD outperforms state-of-the-art methods across multiple architectures, demonstrating strong practical impact for large-scale sequence similarity tasks in DNA storage and related domains.

Abstract

Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed. We first provide a theoretical analysis of the impact of embedding dimension on model performance and present a criterion for selecting an appropriate embedding dimension. Under this embedding dimension, the Poisson regression is introduced by assuming the Levenshtein distance between sequences of fixed length following a Poisson distribution, which naturally aligns with the definition of Levenshtein distance. Moreover, from the perspective of the distribution of embedding distances, Poisson regression approximates the negative log likelihood of the chi-squared distribution and offers advancements in removing the skewness. Through comprehensive experiments on real DNA storage data, we demonstrate the superior performance of the proposed method compared to state-of-the-art approaches.

Levenshtein Distance Embedding with Poisson Regression for DNA Storage

TL;DR

This work tackles efficient approximation of the Levenshtein distance by learning a neural embedding via Poisson regression. It develops a Siamese embedding framework, analyzes how embedding dimension controls approximation variance, and introduces the early-stopping dimension (ESD) to bound dimension growth. The Poisson-regression objective (PNLL) aligns naturally with the distance and asymptotically matches a chi-squared likelihood, yielding improved performance over prior losses, especially for homologous sequence pairs in DNA storage data. Empirical results show PNLL with ESD outperforms state-of-the-art methods across multiple architectures, demonstrating strong practical impact for large-scale sequence similarity tasks in DNA storage and related domains.

Abstract

Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed. We first provide a theoretical analysis of the impact of embedding dimension on model performance and present a criterion for selecting an appropriate embedding dimension. Under this embedding dimension, the Poisson regression is introduced by assuming the Levenshtein distance between sequences of fixed length following a Poisson distribution, which naturally aligns with the definition of Levenshtein distance. Moreover, from the perspective of the distribution of embedding distances, Poisson regression approximates the negative log likelihood of the chi-squared distribution and offers advancements in removing the skewness. Through comprehensive experiments on real DNA storage data, we demonstrate the superior performance of the proposed method compared to state-of-the-art approaches.
Paper Structure (20 sections, 1 theorem, 23 equations, 12 figures, 2 tables)

This paper contains 20 sections, 1 theorem, 23 equations, 12 figures, 2 tables.

Key Result

Theorem 1.1

Given two sequences of length $n$, the Levenshtein distance can't be computed in time $O(n^{2-\delta}), \forall \: \delta >0$, otherwise the Strong Exponential Time Hypothesis would be violated.

Figures (12)

  • Figure 1: The sorted eigenvalues of $\mathrm{cov}(\bm{u_i}-\bm{u_j},\bm{u_i}-\bm{u_j}), (i\neq j)$ are plotted for different choices of the embedding dimension $n$ in the CNN-$5$ embedding network. When $n$ is small, the eigenvalues are distributed around $1$, as in (a)--(d). Increasing the embedding dimension $n$, the sorted eigenvalues decrease to $0$ after some dimension, as in (e)--(h).
  • Figure 2: The global approximation error and homologous approximation error are shown against the embedding dimension in (a) and (b), respectively. The curves are plotted based on the mean and standard deviation over 5 runs. The approximation errors decrease alongwith increase the embedding dimension $n$ until the ESD $n_0$, which is $120$ for CNN-$5$. When the $n>n_0$, there is no gain of the performance on a larger $n$, but the model performance becomes unstable with a larger standard deviation.
  • Figure 3: CNN-$10$. The sorted eigenvalues of $\mathrm{cov}(\bm{u_i}-\bm{u_j},\bm{u_i}-\bm{u_j}), (i\neq j)$ are plotted for different choices of the embedding dimension $n$ for the CNN-$10$ embedding network. When $n$ is small, the eigenvalues are distributed around $1$, as in (a)--(c). Increasing the embedding dimension $n$, the sorted eigenvalues decrease to $0$ after some dimension, as in (d)--(f).
  • Figure 4: CNN-$10$. The AE$_g$ and AE$_h$ are shown against the embedding dimension in (a) and (b), respectively. The curves are plotted based on the mean and standard deviation over 5 runs. The approximation errors decrease alongwith increase the embedding dimension $n$ until the ESD $n_0$, which is $120$ for CNN-$10$. When $n>n_0$, there is no gain of the performance on a larger $n$.
  • Figure 5: CNN-$5$-w. The sorted eigenvalues of $\mathrm{cov}(\bm{u_i}-\bm{u_j},\bm{u_i}-\bm{u_j}), (i\neq j)$ are plotted for different choices of the embedding dimension $n$ for the CNN-$5$-w embedding network. When $n$ is small, the eigenvalues are distributed around $1$, as in (a)--(c). Increasing the embedding dimension $n$, the sorted eigenvalues decrease to $0$ after some dimension, as in (d)--(f).
  • ...and 7 more figures

Theorems & Definitions (1)

  • Theorem 1.1