On the Coverage Required for Diploid Genome Assembly

Daanish Mahajan; Chirag Jain; Navin Kashyap

On the Coverage Required for Diploid Genome Assembly

Daanish Mahajan, Chirag Jain, Navin Kashyap

TL;DR

The paper formalizes the diploid genome assembly problem from an information-theoretic perspective, deriving necessary conditions that any read set must satisfy to reconstruct the two haplotypes up to switch errors. It then analyzes three common assembly paradigms—greedy, De Bruijn graphs, and overlap graphs—establishing necessary and sufficient conditions under bridging and coverage constraints. The key finding is that all three approaches require bridging of double repeats, causing read-length and coverage requirements to be significantly higher than haploid bounds (and higher than the basic $c_{LW}$ lower bound). Empirical evaluation on human chromosome 19 with simulated heterozygosity demonstrates how these bounds manifest in practice and guides future algorithmic and sequencing strategies for diploid genome assembly.

Abstract

The repeat content and heterozygosity rate of a target genome are important factors in determining the feasibility of achieving a complete telomere-to-telomere assembly. The mathematical relationship between the required coverage and read length for the purpose of unique reconstruction remains unexplored for diploid genomes. We investigate the information-theoretic conditions that the given set of sequencing reads must satisfy to achieve the complete reconstruction of the true sequence of a diploid genome. We also analyze the standard greedy and de-Bruijn graph-based assembly algorithms. Our results show that the coverage and read length requirements of the assembly algorithms are considerably higher than the lower bound because both algorithms require the double repeats in the genome to be bridged. Finally, we derive the necessary conditions for the overlap graph-based assembly paradigm.

On the Coverage Required for Diploid Genome Assembly

TL;DR

lower bound). Empirical evaluation on human chromosome 19 with simulated heterozygosity demonstrates how these bounds manifest in practice and guides future algorithmic and sequencing strategies for diploid genome assembly.

Abstract

Paper Structure (29 sections, 9 equations, 17 figures, 1 table, 3 algorithms)

This paper contains 29 sections, 9 equations, 17 figures, 1 table, 3 algorithms.

Introduction
Preliminaries
Notations and Definitions
Problem Statement
Information theoretic necessary conditions
Necessary and sufficient conditions for different algorithms
Greedy algorithm
Algorithm
Necessary and sufficient conditions for correct reconstruction
Proof of the necessity of Conditions \ref{['condition1_greedy']} and \ref{['condition2_greedy']}
Proof for the sufficiency of Conditions \ref{['condition0_greedy']}--\ref{['condition2_greedy']}
De-Bruijn graph
Algorithm
Sufficient conditions for correct reconstruction
Proof of the sufficiency of Conditions \ref{['conditions1_dbg']} and \ref{['conditions2_dbg']}
...and 14 more sections

Figures (17)

Figure 1: Illustration of genome assembly with and without switch errors. The red and blue circles are used to represent nucleotides at heterozygous loci.
Figure 2: An example illustrating four reads sampled from a diploid genome $(\mathcal{H}_0, \mathcal{H}_1)$. $\mathcal{L}_i$ and $\mathcal{L}_{i+1}$ are heterozygous loci.
Figure 3: Reads $x_1^i$ and $x_2^i$ cover the heterozygous locus $\mathcal{L}_i$ and extend the maximum towards $\mathcal{L}_{i+1}$. Similarly, reads $x_3^i$ and $x_4^i$ cover the heterozygous locus $\mathcal{L}_{i+1}$ and extend the maximum towards $\mathcal{L}_{i}$. Substring $r"$ (shown in yellow) is double-bridged by reads $x_1^i$ and $x_4^i$. Double repeat $(r_1', r_2')$ (shown in purple) is well-bridged by reads $x_1^{i + 1}$ and $x_2^{i + 1}$.
Figure 4: Figure shows a pair of inter-double repeats present in a specific orientation that can lead to an alternate reconstruction if Condition I2 is not satisfied. Here, $s_1$ and $s_2$ represent the substrings between the copies of the yellow and blue repeats in the two haplotypes, respectively. $s_1 \neq s_2$ because of the maximality of the double repeats.
Figure 5: Reads sampled from a diploid genome containing a single heterozygous locus.
...and 12 more figures

Theorems & Definitions (24)

Claim 1
proof
Claim 2
proof
Claim 3
proof
Claim 4
proof
Claim 5
proof
...and 14 more

On the Coverage Required for Diploid Genome Assembly

TL;DR

Abstract

On the Coverage Required for Diploid Genome Assembly

Authors

TL;DR

Abstract

Table of Contents

Figures (17)

Theorems & Definitions (24)