On noisy duplication channels with Markov sources

Brendon McBain; James Saunderson; Emanuele Viterbo

On noisy duplication channels with Markov sources

Brendon McBain, James Saunderson, Emanuele Viterbo

TL;DR

This work analyzes channels with noisy duplications motivated by nanopore sequencing and proves an asymptotic equipartition property (AEP) for outputs and joint input-output processes when inputs are ergodic Markov sources, yielding information stability and that the Markov-constrained capacity equals $C_{\mathsf{Markov}} = \sup_{P \in \mathcal{P}} I(\mathbb{S}; \mathbb{Y}^{\mathbb{T}})$. It further relates the AEP for noisy duplications to hidden semi-Markov processes (HSMPs) via embedding arguments and SMB, establishing a bridge between random-length outputs and fixed-length analyses. The paper provides Monte Carlo-based lower bounds for the binary symmetric channel with Bernoulli and geometric duplications and discusses how these bounds connect to sticky-channel capacities. By linking randomly indexed entropy rates, SMP embeddings, and HSMPs, the work lays a theoretical foundation for capacity estimation and coding strategies in nanopore-inspired DNA storage systems. Overall, it identifies open challenges in constructing capacity-achieving Markov codes and emphasizes the practical significance for efficient data storage with noisy duplication channels.

Abstract

Channels with noisy duplications have recently been used to model the nanopore sequencer. This paper extends some foundational information-theoretic results to this new scenario. We prove the asymptotic equipartition property (AEP) for noisy duplication processes based on ergodic Markov processes. A consequence is that the noisy duplication channel is information stable for ergodic Markov sources, and therefore the channel capacity constrained to Markov sources is the Markov-constrained Shannon capacity. We use the AEP to estimate lower bounds on the capacity of the binary symmetric channel with Bernoulli and geometric duplications using Monte Carlo simulations. In addition, we relate the AEP for noisy duplication processes to the AEP for hidden semi-Markov processes.

On noisy duplication channels with Markov sources

TL;DR

. It further relates the AEP for noisy duplications to hidden semi-Markov processes (HSMPs) via embedding arguments and SMB, establishing a bridge between random-length outputs and fixed-length analyses. The paper provides Monte Carlo-based lower bounds for the binary symmetric channel with Bernoulli and geometric duplications and discusses how these bounds connect to sticky-channel capacities. By linking randomly indexed entropy rates, SMP embeddings, and HSMPs, the work lays a theoretical foundation for capacity estimation and coding strategies in nanopore-inspired DNA storage systems. Overall, it identifies open challenges in constructing capacity-achieving Markov codes and emphasizes the practical significance for efficient data storage with noisy duplication channels.

Abstract

Paper Structure (12 sections, 6 theorems, 15 equations, 2 figures)

This paper contains 12 sections, 6 theorems, 15 equations, 2 figures.

Introduction
Noisy duplication channel
Entropy rates
Asymptotic equipartition property
Noisy duplication processes
Hidden semi-Markov processes
Markov-constrained channel capacity
BSC with Bernoulli duplications
BSC with geometric duplications
Conclusion
Preliminary lemmas
Proof of Theorem \ref{['output_AEP']}

Key Result

Lemma 1

The entropy rates $H(\mathbb{Y}^{\mathbb{T}})$ and $H(\mathbb{S},\mathbb{Y}^{\mathbb{T}})$ exist.

Figures (2)

Figure 1: Monte Carlo estimates of the information rate $I_{\text{BSCD,Ber}}(p,p_d)$ of the BSC with error probability $p$, Bernoulli duplications with probability $p_d$, and a $\mathsf{Ber}(1/2)$ source. The information rate $I_{\text{SC,Ber}}(p_d)$ is the case when $p=0$, which corresponds to a sticky channel with capacity $C_{\text{SC,Ber}}(p_d)$ and is computed numerically Mitzenmacher2007.
Figure 2: Monte Carlo estimates of the information rate $I_{\text{BSCD,geom}}(p,p_d)$ of the BSC with error probability $p$, geometric duplications with probability $p_d$, and a $\mathsf{Ber}(1/2)$ source. The information rate $I_{\text{SC,geom}}(p_d)$ is the case when $p=0$, which corresponds to a sticky channel with capacity $C_{\text{SC,geom}}(p_d)$ and is computed numerically Mitzenmacher2007.

Theorems & Definitions (14)

Lemma 1
proof
Theorem 1: Output AEP
Theorem 2: Joint AEP
proof
Definition 1: Embedded SMP Johnson2014
Lemma 2: Randomly indexed entropy rate
proof
Theorem 3: HSMP AEP
proof
...and 4 more

On noisy duplication channels with Markov sources

TL;DR

Abstract

On noisy duplication channels with Markov sources

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (14)