DNA digital data storage and retrieval using algebraic codes
NallappaBhavithran G, Selvakumar R
TL;DR
The paper tackles indel-prone errors and secondary-structure risks in DNA data storage by integrating Varshamov-Tenengolts (VT) codes with kernel codes derived from group homomorphisms. It imposes GC-content and reverse-complement constraints to promote stability and prevent problematic hybridization, and derives a construction that can produce DNA codes of arbitrary length while maintaining a robust RC-distance $d_{RC} = 2\left\lfloor\frac{n-3}{2}\right\rfloor$. The encoding pipeline encodes information with VT codes, maps into a kernel code of length $n+1$, and then applies a homomorphism before final DNA mapping, ensuring single-indel correction and RC/GC compliance. This approach offers a scalable, algebraic framework for stable, error-resilient DNA storage with practical GC-content ranges (approximately 40–60%).
Abstract
DNA is a promising storage medium, but its stability and occurrence of Indel errors pose a significant challenge. The relative occurrence of Guanine(G) and Cytosine(C) in DNA is crucial for its longevity, and reverse complementary base pairs should be avoided to prevent the formation of a secondary structure in DNA strands. We overcome these challenges by selecting appropriate group homomorphisms. For storing and retrieving information in DNA strings we use kernel code and the Varshamov-Tenengolts algorithm. The Varshamov-Tenengolts algorithm corrects single indel errors. Additionally, we construct codes of any desired length (n) while calculating its reverse complement distance based on the value of n.
