Table of Contents
Fetching ...

Learning Structurally Stabilized Representations for Multi-modal Lossless DNA Storage

Ben Cao, Tiantian He, Xue Li, Bin Wang, Xiaohu Wu, Qiang Zhang, Yew-Soon Ong

TL;DR

RSRL introduces an end-to-end framework that combines Reed-Solomon error-correction with biologically inspired stability constraints to enable lossless, multi-modal DNA storage. The method uses **RS(64,48)**-encoded binary streams fed into a Transformer to learn compact representations, which are then transcodeable to DNA via a specialized block mapping. A MASK-MSE loss, paired with GC-content and hairpin minimization terms, enforces burst-error resilience and single-stranded stability, yielding higher information density and robust thermodynamic properties. Experimental results show RSRL achieves lossless recovery with improved net information density and lower error rates, while also delivering faster encoding/decoding relative to strong baselines, underscoring the practicality of integrating error-correction and biomolecular constraints into neural representation learning for DNA storage.

Abstract

In this paper, we present Reed-Solomon coded single-stranded representation learning (RSRL), a novel end-to-end model for learning representations for multi-modal lossless DNA storage. In contrast to existing learning-based methods, the proposed RSRL is inspired by both error-correction codec and structural biology. Specifically, RSRL first learns the representations for the subsequent storage from the binary data transformed by the Reed-Solomon codec. Then, the representations are masked by an RS-code-informed mask to focus on correcting the burst errors occurring in the learning process. With the decoded representations with error corrections, a novel biologically stabilized loss is formulated to regularize the data representations to possess stable single-stranded structures. By incorporating these novel strategies, the proposed RSRL can learn highly durable, dense, and lossless representations for the subsequent storage tasks into DNA sequences. The proposed RSRL has been compared with a number of strong baselines in real-world tasks of multi-modal data storage. The experimental results obtained demonstrate that RSRL can store diverse types of data with much higher information density and durability but much lower error rates.

Learning Structurally Stabilized Representations for Multi-modal Lossless DNA Storage

TL;DR

RSRL introduces an end-to-end framework that combines Reed-Solomon error-correction with biologically inspired stability constraints to enable lossless, multi-modal DNA storage. The method uses **RS(64,48)**-encoded binary streams fed into a Transformer to learn compact representations, which are then transcodeable to DNA via a specialized block mapping. A MASK-MSE loss, paired with GC-content and hairpin minimization terms, enforces burst-error resilience and single-stranded stability, yielding higher information density and robust thermodynamic properties. Experimental results show RSRL achieves lossless recovery with improved net information density and lower error rates, while also delivering faster encoding/decoding relative to strong baselines, underscoring the practicality of integrating error-correction and biomolecular constraints into neural representation learning for DNA storage.

Abstract

In this paper, we present Reed-Solomon coded single-stranded representation learning (RSRL), a novel end-to-end model for learning representations for multi-modal lossless DNA storage. In contrast to existing learning-based methods, the proposed RSRL is inspired by both error-correction codec and structural biology. Specifically, RSRL first learns the representations for the subsequent storage from the binary data transformed by the Reed-Solomon codec. Then, the representations are masked by an RS-code-informed mask to focus on correcting the burst errors occurring in the learning process. With the decoded representations with error corrections, a novel biologically stabilized loss is formulated to regularize the data representations to possess stable single-stranded structures. By incorporating these novel strategies, the proposed RSRL can learn highly durable, dense, and lossless representations for the subsequent storage tasks into DNA sequences. The proposed RSRL has been compared with a number of strong baselines in real-world tasks of multi-modal data storage. The experimental results obtained demonstrate that RSRL can store diverse types of data with much higher information density and durability but much lower error rates.
Paper Structure (32 sections, 10 equations, 8 figures, 7 tables)

This paper contains 32 sections, 10 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Overview of the DNA storage scheme implemented by RSRL.
  • Figure 2: MASK-MSE loss Maximizes the potential of RS error correction codes.
  • Figure 3: The Hair structure.
  • Figure 4: Comparison of Encoding Speed between RSRL and other baselines.
  • Figure 5: Comparison of mean and standard deviation of MFE between RSRL and other baselines.
  • ...and 3 more figures