Levenshtein Distance Embedding with Poisson Regression for DNA Storage
Xiang Wei, Alan J. X. Guo, Sihan Sun, Mengyi Wei, Wei Yu
TL;DR
This work tackles efficient approximation of the Levenshtein distance by learning a neural embedding via Poisson regression. It develops a Siamese embedding framework, analyzes how embedding dimension controls approximation variance, and introduces the early-stopping dimension (ESD) to bound dimension growth. The Poisson-regression objective (PNLL) aligns naturally with the distance and asymptotically matches a chi-squared likelihood, yielding improved performance over prior losses, especially for homologous sequence pairs in DNA storage data. Empirical results show PNLL with ESD outperforms state-of-the-art methods across multiple architectures, demonstrating strong practical impact for large-scale sequence similarity tasks in DNA storage and related domains.
Abstract
Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed. We first provide a theoretical analysis of the impact of embedding dimension on model performance and present a criterion for selecting an appropriate embedding dimension. Under this embedding dimension, the Poisson regression is introduced by assuming the Levenshtein distance between sequences of fixed length following a Poisson distribution, which naturally aligns with the definition of Levenshtein distance. Moreover, from the perspective of the distribution of embedding distances, Poisson regression approximates the negative log likelihood of the chi-squared distribution and offers advancements in removing the skewness. Through comprehensive experiments on real DNA storage data, we demonstrate the superior performance of the proposed method compared to state-of-the-art approaches.
