Table of Contents
Fetching ...

SCONE: A Practical, Constraint-Aware Plug-in for Latent Encoding in Learned DNA Storage

Cihan Ruan, Lebin Zhou, Rongduo Han, Linyi Han, Bingqing Zhao, Chenchen Zhu, Wei Jiang, Wei Wang, Nam Ling

TL;DR

This work tackles the challenge of efficiently encoding data for DNA storage within neural compression pipelines by introducing SCONE, a constraint-aware, quaternary arithmetic coder that operates directly on latent representations. It employs a deterministic FSM to enforce GC balance and homopolymer suppression during encoding, eliminating post-hoc corrections and preserving differentiability with learned priors. The approach yields a reproducible, end-to-end capable interface that achieves about 1.86 bits per nucleotide with high constraint satisfaction (≈$0.50$ GC content and max homopolymer length of 3) and negligible latency. Practically, SCONE enables end-to-end optimization of latent-to-DNA pipelines, offering a scalable, reversible solution for integrated DNA-based neural computation and storage.

Abstract

DNA storage has matured from concept to practical stage, yet its integration with neural compression pipelines remains inefficient. Early DNA encoders applied redundancy-heavy constraint layers atop raw binary data - workable but primitive. Recent neural codecs compress data into learned latent representations with rich statistical structure, yet still convert these latents to DNA via naive binary-to-quaternary transcoding, discarding the entropy model's optimization. This mismatch undermines compression efficiency and complicates the encoding stack. A plug-in module that collapses latent compression and DNA encoding into a single step. SCONE performs quaternary arithmetic coding directly on the latent space in DNA bases. Its Constraint-Aware Adaptive Coding module dynamically steers the entropy encoder's learned probability distribution to enforce biochemical constraints - Guanine-Cytosine (GC) balance and homopolymer suppression - deterministically during encoding, eliminating post-hoc correction. The design preserves full reversibility and exploits the hyperprior model's learned priors without modification. Experiments show SCONE achieves near-perfect constraint satisfaction with negligible computational overhead (<2% latency), establishing a latent-agnostic interface for end-to-end DNA-compatible learned codecs.

SCONE: A Practical, Constraint-Aware Plug-in for Latent Encoding in Learned DNA Storage

TL;DR

This work tackles the challenge of efficiently encoding data for DNA storage within neural compression pipelines by introducing SCONE, a constraint-aware, quaternary arithmetic coder that operates directly on latent representations. It employs a deterministic FSM to enforce GC balance and homopolymer suppression during encoding, eliminating post-hoc corrections and preserving differentiability with learned priors. The approach yields a reproducible, end-to-end capable interface that achieves about 1.86 bits per nucleotide with high constraint satisfaction (≈ GC content and max homopolymer length of 3) and negligible latency. Practically, SCONE enables end-to-end optimization of latent-to-DNA pipelines, offering a scalable, reversible solution for integrated DNA-based neural computation and storage.

Abstract

DNA storage has matured from concept to practical stage, yet its integration with neural compression pipelines remains inefficient. Early DNA encoders applied redundancy-heavy constraint layers atop raw binary data - workable but primitive. Recent neural codecs compress data into learned latent representations with rich statistical structure, yet still convert these latents to DNA via naive binary-to-quaternary transcoding, discarding the entropy model's optimization. This mismatch undermines compression efficiency and complicates the encoding stack. A plug-in module that collapses latent compression and DNA encoding into a single step. SCONE performs quaternary arithmetic coding directly on the latent space in DNA bases. Its Constraint-Aware Adaptive Coding module dynamically steers the entropy encoder's learned probability distribution to enforce biochemical constraints - Guanine-Cytosine (GC) balance and homopolymer suppression - deterministically during encoding, eliminating post-hoc correction. The design preserves full reversibility and exploits the hyperprior model's learned priors without modification. Experiments show SCONE achieves near-perfect constraint satisfaction with negligible computational overhead (<2% latency), establishing a latent-agnostic interface for end-to-end DNA-compatible learned codecs.
Paper Structure (22 sections, 3 equations, 3 figures, 3 tables)

This paper contains 22 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: SCONE framework overview.(A) The SCONE module is designed as a plug-in neural encoder that integrates seamlessly into existing machine learning pipelines. It accepts latent representations $y$ from arbitrary upstream models (e.g., autoencoders, VAEs, image compressors), and transforms them into DNA-compatible representations. After biochemical storage and decoding, the reconstructed latent can be passed into downstream tasks such as lossless recovery of latent tokens or semantic retrieval—demonstrating SCONE’s adaptability to diverse latent spaces. (B)DW-Codec (DNA-Constrained & Wet-loop Codec): a codec stack that translates quantized representations into synthesizable DNA strands via base-4 arithmetic coding and constraint filtering. The decoder path reverses this transformation after sequencing. (C) FSM-regularized base-4 arithmetic encoding: quantized latent symbols are mapped to ATGC bases using a context model. Finite-State Machines (FSMs) are used to filter illegal patterns (e.g., homopolymers or unbalanced GC-content) before base-4 interval encoding.
  • Figure 2: Constraint satisfaction comparison. (a) GC content deviation from target 50%. (b) Maximum homopolymer length.
  • Figure 3: FSM-guided base selection. (a) No FSM: random quaternary sequence with GC=52% and max homopolymer=4 (red box indicates violation). (b) FSM-enabled: satisfies GC $\approx 50\%$ and HP $\leq 3$. (c) FSM state tracking: number of allowed bases per position (green=4 bases, orange=3, red=2).