On Achievable Rates for the Shotgun Sequencing Channel with Erasures

Hrishi Narayanan; Prasad Krishnan; Nita Parekh

On Achievable Rates for the Shotgun Sequencing Channel with Erasures

Hrishi Narayanan, Prasad Krishnan, Nita Parekh

TL;DR

This work analyzes the Shotgun Sequencing Channel with Erasures (SSE($\delta$)), capturing base-call erasures in reads via a per-symbol erasure probability. It extends prior capacity results for the noiseless-read Shotgun Sequencing Channel by deriving an achievable-rate bound through a random code and a three-phase, typicality-based decoder that merges reads into islands. The main result provides a explicit lower bound on achievable rate: $R < (1- e^{-c(1-\delta)}) - (1-\delta)\left(e^{-c\left(1-\frac{1}{\bar{L}(1-\delta)}\right)} - e^{-c}\right)$, which reduces to the known capacity when $\delta=0$, and is supported by concentration lemmas for island formation and coverage. The findings illuminate how quality-score erasures degrade capacity and guide design considerations for DNA storage pipelines, while leaving open the development of tight converses and practical, efficient coding schemes for SSE.

Abstract

In shotgun sequencing, the input string (typically, a long DNA sequence composed of nucleotide bases) is sequenced as multiple overlapping fragments of much shorter lengths (called \textit{reads}). Modelling the shotgun sequencing pipeline as a communication channel for DNA data storage, the capacity of this channel was identified in a recent work, assuming that the reads themselves are noiseless substrings of the original sequence. Modern shotgun sequencers however also output quality scores for each base read, indicating the confidence in its identification. Bases with low quality scores can be considered to be erased. Motivated by this, we consider the \textit{shotgun sequencing channel with erasures}, where each symbol in any read can be independently erased with some probability $δ$. We identify achievable rates for this channel, using a random code construction and a decoder that uses typicality-like arguments to merge the reads.

On Achievable Rates for the Shotgun Sequencing Channel with Erasures

TL;DR

This work analyzes the Shotgun Sequencing Channel with Erasures (SSE(

)), capturing base-call erasures in reads via a per-symbol erasure probability. It extends prior capacity results for the noiseless-read Shotgun Sequencing Channel by deriving an achievable-rate bound through a random code and a three-phase, typicality-based decoder that merges reads into islands. The main result provides a explicit lower bound on achievable rate:

, which reduces to the known capacity when

, and is supported by concentration lemmas for island formation and coverage. The findings illuminate how quality-score erasures degrade capacity and guide design considerations for DNA storage pipelines, while leaving open the development of tight converses and practical, efficient coding schemes for SSE.

Abstract

. We identify achievable rates for this channel, using a random code construction and a decoder that uses typicality-like arguments to merge the reads.

Paper Structure (13 sections, 7 theorems, 82 equations, 3 figures, 1 algorithm)

This paper contains 13 sections, 7 theorems, 82 equations, 3 figures, 1 algorithm.

Introduction
Channel Description and Main Result
Achievability (Proof of Theorem \ref{['thm:main']})
Outline of the Coding Scheme
Merging and Coverage: Definitions and Terminology
Concentration Results and Bounds on Quantities
Decoding Algorithm
Brief overview of the proof of achievability
Detailed Proof of Achievability
Conclusion
Concentration inequalities used in this work
Proof of (\ref{['eqn:boundfor1bynlogCI']}) (bound for $\frac{1}{n} \log{|\mathsf{CI}|}$)
Proof of (\ref{['eqn:closedformbetad']})(expression for $\lim_{d \to 0} \upbeta(d)$)

Key Result

Theorem 1

Let $c$ and $\bar{L}$ be the parameters of $\mathsf{SSE}(\delta)$ such that $c>0$ and $\bar{L}(1-\delta)>1$. Let $\alpha=c/(\bar{L}(1-\delta))$. The rate $R$ is achievable on $\mathsf{SSE}(\delta)$ if

Figures (3)

Figure 1: The Shotgun Sequencing Channel with Erasures ($\mathsf{SSE}(\delta)$). The collection ${\cal \tilde{Y}} = \{\tilde{\underline{y}}_1, \tilde{\underline{y}}_2, \cdots, \tilde{\underline{y}}_{K} \}$ may be visualized as the output of the Shotgun Sequencing Channel ravi_coded_ssc, and ${{\cal Y}} = \{\underline{y}_{1}, \underline{y}_{2}, \cdots, \underline{y}_{K} \}$ is the output of $\mathsf{SSE}(\delta)$, after bits in each read are erased (indicated in bold/red) with probability $\delta$.
Figure 2: The plot shows the rates from Theorem \ref{['thm:main']}, with $\bar{L}=1.75$, as the coverage depth $c$ varies, for $\delta=0, 0.05, 0.2,$ and $0.3$. We compare these with the capacity of the shotgun sequencing channel (denoted by SSC) from ravi_coded_ssc with read lengths $\bar{L}(1-\delta)\log n$.
Figure 3: Illustration of a merge operation.

Theorems & Definitions (24)

Theorem 1
Remark 1
Remark 2
Definition 1: Length and Size of string
Definition 2: Prefix and Suffix
Definition 3: Compatibility, $l$-compatible strings and substring compatibility
Definition 4: Merge of two strings
Definition 5: True Successors, Ordering, and Overlaps
Definition 6: Orderings, Islands, and True Islands
Definition 7
...and 14 more

On Achievable Rates for the Shotgun Sequencing Channel with Erasures

TL;DR

Abstract

On Achievable Rates for the Shotgun Sequencing Channel with Erasures

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (24)