Learning Structurally Stabilized Representations for Multi-modal Lossless DNA Storage
Ben Cao, Tiantian He, Xue Li, Bin Wang, Xiaohu Wu, Qiang Zhang, Yew-Soon Ong
TL;DR
RSRL introduces an end-to-end framework that combines Reed-Solomon error-correction with biologically inspired stability constraints to enable lossless, multi-modal DNA storage. The method uses **RS(64,48)**-encoded binary streams fed into a Transformer to learn compact representations, which are then transcodeable to DNA via a specialized block mapping. A MASK-MSE loss, paired with GC-content and hairpin minimization terms, enforces burst-error resilience and single-stranded stability, yielding higher information density and robust thermodynamic properties. Experimental results show RSRL achieves lossless recovery with improved net information density and lower error rates, while also delivering faster encoding/decoding relative to strong baselines, underscoring the practicality of integrating error-correction and biomolecular constraints into neural representation learning for DNA storage.
Abstract
In this paper, we present Reed-Solomon coded single-stranded representation learning (RSRL), a novel end-to-end model for learning representations for multi-modal lossless DNA storage. In contrast to existing learning-based methods, the proposed RSRL is inspired by both error-correction codec and structural biology. Specifically, RSRL first learns the representations for the subsequent storage from the binary data transformed by the Reed-Solomon codec. Then, the representations are masked by an RS-code-informed mask to focus on correcting the burst errors occurring in the learning process. With the decoded representations with error corrections, a novel biologically stabilized loss is formulated to regularize the data representations to possess stable single-stranded structures. By incorporating these novel strategies, the proposed RSRL can learn highly durable, dense, and lossless representations for the subsequent storage tasks into DNA sequences. The proposed RSRL has been compared with a number of strong baselines in real-world tasks of multi-modal data storage. The experimental results obtained demonstrate that RSRL can store diverse types of data with much higher information density and durability but much lower error rates.
