Table of Contents
Fetching ...

Error-Correcting Codes for Labeled DNA Sequences

Dganit Hanania, Eitan Yaakobi

TL;DR

The paper tackles recovering the original labeling sequence on DNA molecules from noisy readings that may include deletions, insertions, or substitutions. It extends the labeling capacity framework to error-prone channels by developing two regimes: (i) using the full set of length-$2$ labels and (ii) using a minimal length-$2$ label set that still achieves maximal capacity ($\\phi(q)$ labels). It provides both concrete bounds and explicit encoders: a derivative-based construction with lower bounds and sphere-packing-type upper bounds for the all-labels case, a Tenengolts-based systematic encoder for the minimal-label-set case, and a substitution-correcting scheme via Hamming-code cosets with a second explicit encoder. Collectively, these results enable efficient, linear-time encoding and decoding for robust labeling of DNA sequences under single deletion/insertion and single substitution errors, with applicability to general alphabets beyond DNA.

Abstract

Labeling of DNA molecules is a fundamental technique for DNA visualization and analysis. This process was mathematically modeled in [1], where the received sequence indicates the positions of the used labels. In this work, we develop error correcting codes for labeled DNA sequences, establishing bounds and constructing explicit systematic encoders for single substitution, insertion, and deletion errors. We focus on two cases: (1) using the complete set of length-two labels and (2) using the minimal set of length-two labels that ensures the recovery of DNA sequences from their labeling for 'almost' all DNA sequences.

Error-Correcting Codes for Labeled DNA Sequences

TL;DR

The paper tackles recovering the original labeling sequence on DNA molecules from noisy readings that may include deletions, insertions, or substitutions. It extends the labeling capacity framework to error-prone channels by developing two regimes: (i) using the full set of length- labels and (ii) using a minimal length- label set that still achieves maximal capacity ( labels). It provides both concrete bounds and explicit encoders: a derivative-based construction with lower bounds and sphere-packing-type upper bounds for the all-labels case, a Tenengolts-based systematic encoder for the minimal-label-set case, and a substitution-correcting scheme via Hamming-code cosets with a second explicit encoder. Collectively, these results enable efficient, linear-time encoding and decoding for robust labeling of DNA sequences under single deletion/insertion and single substitution errors, with applicability to general alphabets beyond DNA.

Abstract

Labeling of DNA molecules is a fundamental technique for DNA visualization and analysis. This process was mathematically modeled in [1], where the received sequence indicates the positions of the used labels. In this work, we develop error correcting codes for labeled DNA sequences, establishing bounds and constructing explicit systematic encoders for single substitution, insertion, and deletion errors. We focus on two cases: (1) using the complete set of length-two labels and (2) using the minimal set of length-two labels that ensures the recovery of DNA sequences from their labeling for 'almost' all DNA sequences.

Paper Structure

This paper contains 6 sections, 11 theorems, 33 equations.

Key Result

Theorem 1

( hanania2024capacityjournal) The minimal number of labels of length two over $\Sigma_q$ required to achieve the full labeling capacity is $\nolinebreak{\phi(q):=q^2-n(q)}$, where

Theorems & Definitions (27)

  • Definition 1
  • Example 1
  • Theorem 1
  • Definition 2
  • Theorem 2
  • Remark 1
  • Remark 2
  • Definition 3
  • Definition 4
  • Claim 1
  • ...and 17 more