Table of Contents
Fetching ...

On the Coverage Required for Diploid Genome Assembly

Daanish Mahajan, Chirag Jain, Navin Kashyap

TL;DR

The paper formalizes the diploid genome assembly problem from an information-theoretic perspective, deriving necessary conditions that any read set must satisfy to reconstruct the two haplotypes up to switch errors. It then analyzes three common assembly paradigms—greedy, De Bruijn graphs, and overlap graphs—establishing necessary and sufficient conditions under bridging and coverage constraints. The key finding is that all three approaches require bridging of double repeats, causing read-length and coverage requirements to be significantly higher than haploid bounds (and higher than the basic $c_{LW}$ lower bound). Empirical evaluation on human chromosome 19 with simulated heterozygosity demonstrates how these bounds manifest in practice and guides future algorithmic and sequencing strategies for diploid genome assembly.

Abstract

The repeat content and heterozygosity rate of a target genome are important factors in determining the feasibility of achieving a complete telomere-to-telomere assembly. The mathematical relationship between the required coverage and read length for the purpose of unique reconstruction remains unexplored for diploid genomes. We investigate the information-theoretic conditions that the given set of sequencing reads must satisfy to achieve the complete reconstruction of the true sequence of a diploid genome. We also analyze the standard greedy and de-Bruijn graph-based assembly algorithms. Our results show that the coverage and read length requirements of the assembly algorithms are considerably higher than the lower bound because both algorithms require the double repeats in the genome to be bridged. Finally, we derive the necessary conditions for the overlap graph-based assembly paradigm.

On the Coverage Required for Diploid Genome Assembly

TL;DR

The paper formalizes the diploid genome assembly problem from an information-theoretic perspective, deriving necessary conditions that any read set must satisfy to reconstruct the two haplotypes up to switch errors. It then analyzes three common assembly paradigms—greedy, De Bruijn graphs, and overlap graphs—establishing necessary and sufficient conditions under bridging and coverage constraints. The key finding is that all three approaches require bridging of double repeats, causing read-length and coverage requirements to be significantly higher than haploid bounds (and higher than the basic lower bound). Empirical evaluation on human chromosome 19 with simulated heterozygosity demonstrates how these bounds manifest in practice and guides future algorithmic and sequencing strategies for diploid genome assembly.

Abstract

The repeat content and heterozygosity rate of a target genome are important factors in determining the feasibility of achieving a complete telomere-to-telomere assembly. The mathematical relationship between the required coverage and read length for the purpose of unique reconstruction remains unexplored for diploid genomes. We investigate the information-theoretic conditions that the given set of sequencing reads must satisfy to achieve the complete reconstruction of the true sequence of a diploid genome. We also analyze the standard greedy and de-Bruijn graph-based assembly algorithms. Our results show that the coverage and read length requirements of the assembly algorithms are considerably higher than the lower bound because both algorithms require the double repeats in the genome to be bridged. Finally, we derive the necessary conditions for the overlap graph-based assembly paradigm.
Paper Structure (29 sections, 9 equations, 17 figures, 1 table, 3 algorithms)

This paper contains 29 sections, 9 equations, 17 figures, 1 table, 3 algorithms.

Figures (17)

  • Figure 1: Illustration of genome assembly with and without switch errors. The red and blue circles are used to represent nucleotides at heterozygous loci.
  • Figure 2: An example illustrating four reads sampled from a diploid genome $(\mathcal{H}_0, \mathcal{H}_1)$. $\mathcal{L}_i$ and $\mathcal{L}_{i+1}$ are heterozygous loci.
  • Figure 3: Reads $x_1^i$ and $x_2^i$ cover the heterozygous locus $\mathcal{L}_i$ and extend the maximum towards $\mathcal{L}_{i+1}$. Similarly, reads $x_3^i$ and $x_4^i$ cover the heterozygous locus $\mathcal{L}_{i+1}$ and extend the maximum towards $\mathcal{L}_{i}$. Substring $r"$ (shown in yellow) is double-bridged by reads $x_1^i$ and $x_4^i$. Double repeat $(r_1', r_2')$ (shown in purple) is well-bridged by reads $x_1^{i + 1}$ and $x_2^{i + 1}$.
  • Figure 4: Figure shows a pair of inter-double repeats present in a specific orientation that can lead to an alternate reconstruction if Condition I2 is not satisfied. Here, $s_1$ and $s_2$ represent the substrings between the copies of the yellow and blue repeats in the two haplotypes, respectively. $s_1 \neq s_2$ because of the maximality of the double repeats.
  • Figure 5: Reads sampled from a diploid genome containing a single heterozygous locus.
  • ...and 12 more figures

Theorems & Definitions (24)

  • Claim 1
  • proof
  • Claim 2
  • proof
  • Claim 3
  • proof
  • Claim 4
  • proof
  • Claim 5
  • proof
  • ...and 14 more