Coding Over Coupon Collector Channels for Combinatorial Motif-Based DNA Storage

Roman Sokolovskii; Parv Agarwal; Luis Alberto Croquevielle; Zijian Zhou; Thomas Heinis

Coding Over Coupon Collector Channels for Combinatorial Motif-Based DNA Storage

Roman Sokolovskii, Parv Agarwal, Luis Alberto Croquevielle, Zijian Zhou, Thomas Heinis

TL;DR

This work addresses DNA data storage using combinatorial motif encoding, where data blocks are formed from k-subsets of an n-motif library and synthesis produces random motif attachments per cycle. It formalizes Coupon Collector–type channels, derives capacity for the interference-free CC(n,k,R) channel, and extends to erasure NBEC and interference CC(n,k,R,ρ) models, connecting capacity with read-write costs. A coding framework based on non-binary SC-LDPC codes over GF(q), plus symmetry-preserving masking and a library-splitting scheme to curb decoding complexity, demonstrates near-capacity performance in numerical results. The approach leverages all information at the channel output (partial symbol observations) to surpass previous full-symbol-decoding methods, offering a practical path toward high-density, robust DNA storage with realistic read/write trade-offs.

Abstract

Encoding information in combinations of pre-synthesised deoxyribonucleic acid (DNA) strands (referred to as motifs) is an interesting approach to DNA storage that could potentially circumvent the prohibitive costs of nucleotide-by-nucleotide DNA synthesis. Based on our analysis of an empirical data set from HelixWorks, we propose two channel models for this setup (with and without interference) and analyse their fundamental limits. We propose a coding scheme that approaches those limits by leveraging all information available at the output of the channel, in contrast to earlier schemes developed for a similar setup by Preuss et al. We highlight an important connection between channel capacity curves and the fundamental trade-off between synthesis (writing) and sequencing (reading), and offer a way to mitigate an exponential growth in decoding complexity with the size of the motif library.

Coding Over Coupon Collector Channels for Combinatorial Motif-Based DNA Storage

TL;DR

Abstract

Paper Structure (17 sections, 3 theorems, 26 equations, 8 figures)

This paper contains 17 sections, 3 theorems, 26 equations, 8 figures.

Introduction
Motif-Based DNA Storage
General Principle
Experimental Data Set
Channel Simulator
Coupon Collector Channels
Capacity of the $\mathsf{CC}(n,k,R)$ Channel
Erasure Version of the Coupon Collector Channel
Coupon Collector Channel with Interference
Read-Write Cost Trade-Off
Non-Binary LDPC Codes Over the Coupon Collector Channels
No Interference
Interference
Protograph-Based Non-Binary SC-LDPC Codes
Combinatorial Explosion in Decoding Complexity
...and 2 more sections

Key Result

Theorem 1

The capacity of $\mathsf{CC}(n,k,R)$ is given by where $L=\min\{k, R\}$, and $\genfrac{\{}{\}}{0pt}{}{R}{\ell} = 1/\ell!\sum_{i=0}^\ell (-1)^{\ell-i} \binom{\ell}{i} i^R$ are the Stirling numbers of the second kind.

Figures (8)

Figure 1: Combinatorial DNA data storage pipeline used in ref:Yan23. First, the data is encoded into blocks, each containing $2$ address-carrying symbols (denoted by $A0$ as $A1$) and $8$ payload-carrying symbols; a payload symbol (shown as a row of the data block) is a combination of $4$ distinct motifs chosen out of a library of $8$. Second, the DNA pool is synthesised: in each payload synthesis cycle, the $4$ chosen motifs corresponding to the synthesised symbol are added to the reaction tube containing a set of growing strands (shown vertically as columns); a random motif is attached to each strand. The bioinformatics pipeline at the reading stage includes sequencing, basecalling, and motif detection. It produces the original data blocks with some motifs missing or interfering due to synthesis and detection errors.
Figure 2: Empirical and simulated uncoded data recovery CDFs and motif histograms. The first two subplots correspond to block $21$. The first subplot shows a close match between the simulated (blue curve, circle marker) and empirical (red curve, square marker) CDF. The representative interfered motif histogram for symbol $8$ in this block (second subplot, red dashed bars) is relatively even. The last two plots correspond to block $1$. A large discrepancy between the simulated and empirical CDF (third plot) is caused by imbalanced interfered motif histogram (last plot, representative symbol $8$, red dashed bars).
Figure 3: An example of transmission over the Coupon Collector Channel $\mathsf{CC}(n=8,k=4,R=6)$. From a library of $n=8$ motifs, a combination of $k=4$ is chosen for transmission. The Coupon Collector Channel generates $R=6$ reads by uniform sampling with replacement among the $k$ chosen motifs. In the illustrated example, only $\ell=3$ motifs out of $k=4$ are encountered.
Figure 4: Capacities of $\mathsf{CC}$ (blue, circle marker) from \ref{['eq:capacity_cc']} and of $\mathsf{NBEC}$ (red, square marker) from \ref{['eq:capacity_nbec']} for $(n=8,k=4)$. The horizontal dashed line shows the asymptotic limit for both curves. The region of operation enabled by full processing of partial information is shaded green. The black dot shows the operating point of the coding scheme proposed in Section \ref{['sec:ldpc']}, see Section \ref{['sec:results']} for details. The code operates well above the capacity of $\mathsf{NBEC}$.
Figure 5: Top: capacity curves for the Coupon Collector Channel with $(n=8,k=4)$, no interference (black curve, no marker) and $\rho=0.078$ (blue curve, circle marker). For the hard-decision decoding scheme that waits for $t$ copies of the top $k$ motifs before making a decision, the capacity of the associated non-binary erasure channel (neglecting interference) is shown for $t=2$ (purple curve, square marker) and $t=3$ (red curve, triangle marker). Hard-decision schemes that mitigate interference by accumulating $t > 1$ copies of top-$k$ motifs cannot operate in the green-shaded region even when no interference is present. In contrast, the black dot shows the operating point of the coding scheme proposed in Section \ref{['sec:ldpc']} for $\rho=0.078$, see Section \ref{['sec:results']} for details. Soft information processing allows the scheme to operate in the green-shaded region. Bottom: symbol-level substitution rates for the same hard-decision schemes with $t = 2$ and $t = 3$ as in the top figure (same line styles) and $\rho=0.078$. These errors are not accounted for in the corresponding capacity curves, making the green-shaded region underestimate the potential gain from soft-decision information processing.
...and 3 more figures

Theorems & Definitions (7)

Theorem 1
Lemma 2
proof
proof : Proof of Theorem \ref{['thm:capacity-cc-channel-no-interference']}
proof : Uniform input distribution
proof : Symmetry
Lemma 3: ref:gallager1968information

Coding Over Coupon Collector Channels for Combinatorial Motif-Based DNA Storage

TL;DR

Abstract

Coding Over Coupon Collector Channels for Combinatorial Motif-Based DNA Storage

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (7)