Table of Contents
Fetching ...

On noisy duplication channels with Markov sources

Brendon McBain, James Saunderson, Emanuele Viterbo

TL;DR

This work analyzes channels with noisy duplications motivated by nanopore sequencing and proves an asymptotic equipartition property (AEP) for outputs and joint input-output processes when inputs are ergodic Markov sources, yielding information stability and that the Markov-constrained capacity equals $C_{\mathsf{Markov}} = \sup_{P \in \mathcal{P}} I(\mathbb{S}; \mathbb{Y}^{\mathbb{T}})$. It further relates the AEP for noisy duplications to hidden semi-Markov processes (HSMPs) via embedding arguments and SMB, establishing a bridge between random-length outputs and fixed-length analyses. The paper provides Monte Carlo-based lower bounds for the binary symmetric channel with Bernoulli and geometric duplications and discusses how these bounds connect to sticky-channel capacities. By linking randomly indexed entropy rates, SMP embeddings, and HSMPs, the work lays a theoretical foundation for capacity estimation and coding strategies in nanopore-inspired DNA storage systems. Overall, it identifies open challenges in constructing capacity-achieving Markov codes and emphasizes the practical significance for efficient data storage with noisy duplication channels.

Abstract

Channels with noisy duplications have recently been used to model the nanopore sequencer. This paper extends some foundational information-theoretic results to this new scenario. We prove the asymptotic equipartition property (AEP) for noisy duplication processes based on ergodic Markov processes. A consequence is that the noisy duplication channel is information stable for ergodic Markov sources, and therefore the channel capacity constrained to Markov sources is the Markov-constrained Shannon capacity. We use the AEP to estimate lower bounds on the capacity of the binary symmetric channel with Bernoulli and geometric duplications using Monte Carlo simulations. In addition, we relate the AEP for noisy duplication processes to the AEP for hidden semi-Markov processes.

On noisy duplication channels with Markov sources

TL;DR

This work analyzes channels with noisy duplications motivated by nanopore sequencing and proves an asymptotic equipartition property (AEP) for outputs and joint input-output processes when inputs are ergodic Markov sources, yielding information stability and that the Markov-constrained capacity equals . It further relates the AEP for noisy duplications to hidden semi-Markov processes (HSMPs) via embedding arguments and SMB, establishing a bridge between random-length outputs and fixed-length analyses. The paper provides Monte Carlo-based lower bounds for the binary symmetric channel with Bernoulli and geometric duplications and discusses how these bounds connect to sticky-channel capacities. By linking randomly indexed entropy rates, SMP embeddings, and HSMPs, the work lays a theoretical foundation for capacity estimation and coding strategies in nanopore-inspired DNA storage systems. Overall, it identifies open challenges in constructing capacity-achieving Markov codes and emphasizes the practical significance for efficient data storage with noisy duplication channels.

Abstract

Channels with noisy duplications have recently been used to model the nanopore sequencer. This paper extends some foundational information-theoretic results to this new scenario. We prove the asymptotic equipartition property (AEP) for noisy duplication processes based on ergodic Markov processes. A consequence is that the noisy duplication channel is information stable for ergodic Markov sources, and therefore the channel capacity constrained to Markov sources is the Markov-constrained Shannon capacity. We use the AEP to estimate lower bounds on the capacity of the binary symmetric channel with Bernoulli and geometric duplications using Monte Carlo simulations. In addition, we relate the AEP for noisy duplication processes to the AEP for hidden semi-Markov processes.
Paper Structure (12 sections, 6 theorems, 15 equations, 2 figures)

This paper contains 12 sections, 6 theorems, 15 equations, 2 figures.

Key Result

Lemma 1

The entropy rates $H(\mathbb{Y}^{\mathbb{T}})$ and $H(\mathbb{S},\mathbb{Y}^{\mathbb{T}})$ exist.

Figures (2)

  • Figure 1: Monte Carlo estimates of the information rate $I_{\text{BSCD,Ber}}(p,p_d)$ of the BSC with error probability $p$, Bernoulli duplications with probability $p_d$, and a $\mathsf{Ber}(1/2)$ source. The information rate $I_{\text{SC,Ber}}(p_d)$ is the case when $p=0$, which corresponds to a sticky channel with capacity $C_{\text{SC,Ber}}(p_d)$ and is computed numerically Mitzenmacher2007.
  • Figure 2: Monte Carlo estimates of the information rate $I_{\text{BSCD,geom}}(p,p_d)$ of the BSC with error probability $p$, geometric duplications with probability $p_d$, and a $\mathsf{Ber}(1/2)$ source. The information rate $I_{\text{SC,geom}}(p_d)$ is the case when $p=0$, which corresponds to a sticky channel with capacity $C_{\text{SC,geom}}(p_d)$ and is computed numerically Mitzenmacher2007.

Theorems & Definitions (14)

  • Lemma 1
  • proof
  • Theorem 1: Output AEP
  • Theorem 2: Joint AEP
  • proof
  • Definition 1: Embedded SMP Johnson2014
  • Lemma 2: Randomly indexed entropy rate
  • proof
  • Theorem 3: HSMP AEP
  • proof
  • ...and 4 more