Table of Contents
Fetching ...

Coded Information Retrieval for Block-Structured DNA-Based Data Storage

Daniella Bar-Lev

Abstract

We study the problem of coded information retrieval for block-structured data, motivated by DNA-based storage systems where a database is partitioned into multiple files that must each be recoverable as an atomic unit. We initiate and formalize the block-structured retrieval problem, wherein $k$ information symbols are partitioned into two files $F_1$ and $F_2$ of sizes $s_1$ and $s_2 = k - s_1$. The objective is to characterize the set of achievable expected retrieval time pairs $\bigl(E_1(G), E_2(G)\bigr)$ over all $[n,k]$ linear codes with generator matrix $G$. We derive a family of linear lower bounds via mutual exclusivity of recovery sets, and develop a nonlinear geometric bound via column projection. For codes with no mixed columns, this yields the hyperbolic constraint $s_1/E_1 + s_2/E_2 \le 1$, which we conjecture to hold universally whenever $\max\{s_1,s_2\} \ge 2$. We analyze explicit codes, such as the identity code, file-dedicated MDS codes, and the systematic global MDS code, and compute their exact expected retrieval times. For file-dedicated codes we prove MDS optimality within the family and verify the hyperbolic constraint. For global MDS codes, we establish dominance by the proportional local MDS allocation via a combinatorial subset-counting argument, providing a significantly simpler proof compared to recent literature and formally extending the result to the asymmetric case. Finally, we characterize the limiting achievability region as $n \to \infty$: the hyperbolic boundary is asymptotically achieved by file-dedicated MDS codes, and is conjectured to be the exact boundary of the limiting achievability region.

Coded Information Retrieval for Block-Structured DNA-Based Data Storage

Abstract

We study the problem of coded information retrieval for block-structured data, motivated by DNA-based storage systems where a database is partitioned into multiple files that must each be recoverable as an atomic unit. We initiate and formalize the block-structured retrieval problem, wherein information symbols are partitioned into two files and of sizes and . The objective is to characterize the set of achievable expected retrieval time pairs over all linear codes with generator matrix . We derive a family of linear lower bounds via mutual exclusivity of recovery sets, and develop a nonlinear geometric bound via column projection. For codes with no mixed columns, this yields the hyperbolic constraint , which we conjecture to hold universally whenever . We analyze explicit codes, such as the identity code, file-dedicated MDS codes, and the systematic global MDS code, and compute their exact expected retrieval times. For file-dedicated codes we prove MDS optimality within the family and verify the hyperbolic constraint. For global MDS codes, we establish dominance by the proportional local MDS allocation via a combinatorial subset-counting argument, providing a significantly simpler proof compared to recent literature and formally extending the result to the asymmetric case. Finally, we characterize the limiting achievability region as : the hyperbolic boundary is asymptotically achieved by file-dedicated MDS codes, and is conjectured to be the exact boundary of the limiting achievability region.
Paper Structure (21 sections, 32 theorems, 109 equations, 4 figures)

This paper contains 21 sections, 32 theorems, 109 equations, 4 figures.

Key Result

Lemma 1

For any generator matrix $G \in \mathbb{F}_q^{k \times n}$ and target index set $I \subseteq [k]$, the expected number of uniform draws with replacement to recover all coupons in $I$ is given by:

Figures (4)

  • Figure 1: Combinatorial cut bounds for $k=8$, $n=20$, across four partition configurations $(s_1,s_2)$. The solid line is the tightest cut $s^*=k-1$; dashed lines show weaker cuts. Black dots are local MDS operating points.
  • Figure 2: Same as \ref{['fig:cuts_n20']} but for $n=50$. The cuts tighten noticeably and the local MDS points approach the cut boundaries.
  • Figure 3: Nonlinear bounds for $k=8$, $n=20$, across four partition configurations $(s_1,s_2)$. Solid lines are proved bounds (Corollaries \ref{['cor:smax_bound']} and \ref{['cor:cs_bound']}); the dashed line is the conjectured hyperbolic boundary (Conjecture \ref{['conj:hyperbola']}). Black dots are local MDS operating points. The gray region is provably non-achievable.
  • Figure 4: Same as \ref{['fig:nonlinear_n20']} but for $n=50$. The local MDS points approach the conjectured hyperbolic boundary as $n$ grows, consistent with Theorem \ref{['thm:asymptotic_ub']}.

Theorems & Definitions (79)

  • Definition 1: Recovery Set gruica2024combinatorial
  • Definition 2: Subset Count gruica2024combinatorial
  • Lemma 1: gruica2024combinatorial
  • Definition 3: File Retrieval Time
  • Definition 4: Column Counting Function
  • Lemma 2
  • proof
  • Lemma 3: abraham2024covering
  • Example 1
  • Example 2
  • ...and 69 more