Protecting the Future of Information: LOCO Coding With Error Detection for DNA Data Storage

Canberk İrimağzı; Yusuf Uslan; Ahmed Hareedy

Protecting the Future of Information: LOCO Coding With Error Detection for DNA Data Storage

Canberk İrimağzı, Yusuf Uslan, Ahmed Hareedy

TL;DR

This work introduces D-LOCO codes, constrained codes defined over the DNA alphabet $\\{A, T, G, C\\}$ that forbid long runs of identical symbols to mitigate storage errors and achieve GC-content balance. It provides a simple, scalable encoding-decoding rule that maps between messages and codewords without lookups, and proves capacity-achieving performance through a recurrence for code cardinality and a spectral analysis of the associated finite-state transition diagram. To increase reliability, four bridging schemes are proposed, three of which enable single-substitution error detection while maintaining low bandwidth overhead; balancing algorithms ensure near-50% GC-content with minimal rate loss. The paper also analyzes error-detection performance, finite-length rates, and implementation complexity, showing that D-LOCO codes are reconfigurable, parallelizable, and suitable for practical DNA data-storage systems with high rates and robust error detection. Overall, D-LOCO codes offer a practical, capacity-achieving solution for DNA storage that combines low-complexity encoding/decoding, error detection, and flexible balancing.

Abstract

DNA strands serve as a storage medium for $4$-ary data over the alphabet $\{A,T,G,C\}$. DNA data storage promises formidable information density, long-term durability, and ease of replicability. However, information in this intriguing storage technology might be corrupted. Experiments have revealed that DNA sequences with long homopolymers and/or with low $GC$-content are notably more subject to errors upon storage. This paper investigates the utilization of the recently-introduced method for designing lexicographically-ordered constrained (LOCO) codes in DNA data storage. This paper introduces DNA LOCO (D-LOCO) codes, over the alphabet $\{A,T,G,C\}$ with limited runs of identical symbols. These codes come with an encoding-decoding rule we derive, which provides affordable encoding-decoding algorithms. In terms of storage overhead, the proposed encoding-decoding algorithms outperform those in the existing literature. Our algorithms are readily reconfigurable. D-LOCO codes are intrinsically balanced, which allows us to achieve balancing over the entire DNA strand with minimal rate penalty. Moreover, we propose four schemes to bridge consecutive codewords, three of which guarantee single substitution error detection per codeword. We examine the probability of undetecting errors. We also show that D-LOCO codes are capacity-achieving and that they offer remarkably high rates at moderate lengths.

Protecting the Future of Information: LOCO Coding With Error Detection for DNA Data Storage

TL;DR

This work introduces D-LOCO codes, constrained codes defined over the DNA alphabet

that forbid long runs of identical symbols to mitigate storage errors and achieve GC-content balance. It provides a simple, scalable encoding-decoding rule that maps between messages and codewords without lookups, and proves capacity-achieving performance through a recurrence for code cardinality and a spectral analysis of the associated finite-state transition diagram. To increase reliability, four bridging schemes are proposed, three of which enable single-substitution error detection while maintaining low bandwidth overhead; balancing algorithms ensure near-50% GC-content with minimal rate loss. The paper also analyzes error-detection performance, finite-length rates, and implementation complexity, showing that D-LOCO codes are reconfigurable, parallelizable, and suitable for practical DNA data-storage systems with high rates and robust error detection. Overall, D-LOCO codes offer a practical, capacity-achieving solution for DNA storage that combines low-complexity encoding/decoding, error detection, and flexible balancing.

Abstract

DNA strands serve as a storage medium for

-ary data over the alphabet

. DNA data storage promises formidable information density, long-term durability, and ease of replicability. However, information in this intriguing storage technology might be corrupted. Experiments have revealed that DNA sequences with long homopolymers and/or with low

-content are notably more subject to errors upon storage. This paper investigates the utilization of the recently-introduced method for designing lexicographically-ordered constrained (LOCO) codes in DNA data storage. This paper introduces DNA LOCO (D-LOCO) codes, over the alphabet

with limited runs of identical symbols. These codes come with an encoding-decoding rule we derive, which provides affordable encoding-decoding algorithms. In terms of storage overhead, the proposed encoding-decoding algorithms outperform those in the existing literature. Our algorithms are readily reconfigurable. D-LOCO codes are intrinsically balanced, which allows us to achieve balancing over the entire DNA strand with minimal rate penalty. Moreover, we propose four schemes to bridge consecutive codewords, three of which guarantee single substitution error detection per codeword. We examine the probability of undetecting errors. We also show that D-LOCO codes are capacity-achieving and that they offer remarkably high rates at moderate lengths.

Paper Structure (22 sections, 6 theorems, 56 equations, 3 figures, 3 tables, 2 algorithms)

This paper contains 22 sections, 6 theorems, 56 equations, 3 figures, 3 tables, 2 algorithms.

Introduction
Some Related Works
Our Contribution and Organization of the Paper
Definition and Cardinality
D-LOCO Encoding-Decoding Rule
Encoding-Decoding Rule for $\ell=3$
Encoding-Decoding Rule for General $\ell$
Bridging and Error Detection
Bridging Scheme I
Bridging Scheme II
Bridging Scheme III
Probability of Not Detecting Errors
Algorithms and Balancing the DNA Sequence
Balancing the D-LOCO Codes of Odd Lengths
Achievable Rates for $\ell=3$ and Literature Comparison
...and 7 more sections

Key Result

Proposition 1

( immink_cai) The cardinality $N(m)$ of the D-LOCO code $\mathcal{D}_{m,\ell}$, where $\ell \geq 1$, satisfies the following recursive relation for $m \geq \ell$: For $0 \leq m \leq \ell$,

Figures (3)

Figure 1: Upper bounds on the probability of no-detection for $\mathcal{D}_{3m',3}$ and Bridging Scheme III versus $\mathcal{D}_{m,3}$ and Bridging Scheme II-B, $m'=13$ and $m=21$.
Figure 2: Upper bounds on the probability of no-detection for $\mathcal{D}_{3m',3}$ and Bridging Scheme III versus $\mathcal{D}_{m,3}$ and Bridging Scheme II-B, $m'=21$ and $m=33$.
Figure 3: An FSTD of $\mathcal{F}$-constrained sequences, where $\Lambda' \in \{A,T,G,C\} \setminus \{\Lambda\}$. Note that $\Lambda$ represents the last generated symbol at any state. Upon entering $F_1$, $\Lambda$ is always updated to become the relevant $\Lambda'$, and $\Lambda'$ symbols on different FSTD edges are not necessarily the same.

Theorems & Definitions (26)

Definition 1
Example 1
Proposition 1
Remark 1
Theorem 1
Remark 2
Example 2
Theorem 2
Example 3
Remark 3
...and 16 more

Protecting the Future of Information: LOCO Coding With Error Detection for DNA Data Storage

TL;DR

Abstract

Protecting the Future of Information: LOCO Coding With Error Detection for DNA Data Storage

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (26)