On the Coverage Required for Diploid Genome Assembly
Daanish Mahajan, Chirag Jain, Navin Kashyap
TL;DR
The paper formalizes the diploid genome assembly problem from an information-theoretic perspective, deriving necessary conditions that any read set must satisfy to reconstruct the two haplotypes up to switch errors. It then analyzes three common assembly paradigms—greedy, De Bruijn graphs, and overlap graphs—establishing necessary and sufficient conditions under bridging and coverage constraints. The key finding is that all three approaches require bridging of double repeats, causing read-length and coverage requirements to be significantly higher than haploid bounds (and higher than the basic $c_{LW}$ lower bound). Empirical evaluation on human chromosome 19 with simulated heterozygosity demonstrates how these bounds manifest in practice and guides future algorithmic and sequencing strategies for diploid genome assembly.
Abstract
The repeat content and heterozygosity rate of a target genome are important factors in determining the feasibility of achieving a complete telomere-to-telomere assembly. The mathematical relationship between the required coverage and read length for the purpose of unique reconstruction remains unexplored for diploid genomes. We investigate the information-theoretic conditions that the given set of sequencing reads must satisfy to achieve the complete reconstruction of the true sequence of a diploid genome. We also analyze the standard greedy and de-Bruijn graph-based assembly algorithms. Our results show that the coverage and read length requirements of the assembly algorithms are considerably higher than the lower bound because both algorithms require the double repeats in the genome to be bridged. Finally, we derive the necessary conditions for the overlap graph-based assembly paradigm.
