SCONE: A Practical, Constraint-Aware Plug-in for Latent Encoding in Learned DNA Storage
Cihan Ruan, Lebin Zhou, Rongduo Han, Linyi Han, Bingqing Zhao, Chenchen Zhu, Wei Jiang, Wei Wang, Nam Ling
TL;DR
This work tackles the challenge of efficiently encoding data for DNA storage within neural compression pipelines by introducing SCONE, a constraint-aware, quaternary arithmetic coder that operates directly on latent representations. It employs a deterministic FSM to enforce GC balance and homopolymer suppression during encoding, eliminating post-hoc corrections and preserving differentiability with learned priors. The approach yields a reproducible, end-to-end capable interface that achieves about 1.86 bits per nucleotide with high constraint satisfaction (≈$0.50$ GC content and max homopolymer length of 3) and negligible latency. Practically, SCONE enables end-to-end optimization of latent-to-DNA pipelines, offering a scalable, reversible solution for integrated DNA-based neural computation and storage.
Abstract
DNA storage has matured from concept to practical stage, yet its integration with neural compression pipelines remains inefficient. Early DNA encoders applied redundancy-heavy constraint layers atop raw binary data - workable but primitive. Recent neural codecs compress data into learned latent representations with rich statistical structure, yet still convert these latents to DNA via naive binary-to-quaternary transcoding, discarding the entropy model's optimization. This mismatch undermines compression efficiency and complicates the encoding stack. A plug-in module that collapses latent compression and DNA encoding into a single step. SCONE performs quaternary arithmetic coding directly on the latent space in DNA bases. Its Constraint-Aware Adaptive Coding module dynamically steers the entropy encoder's learned probability distribution to enforce biochemical constraints - Guanine-Cytosine (GC) balance and homopolymer suppression - deterministically during encoding, eliminating post-hoc correction. The design preserves full reversibility and exploits the hyperprior model's learned priors without modification. Experiments show SCONE achieves near-perfect constraint satisfaction with negligible computational overhead (<2% latency), establishing a latent-agnostic interface for end-to-end DNA-compatible learned codecs.
