Table of Contents
Fetching ...

Statistical distributions of sequencing by synthesis with probabilistic nucleotide incorporation

Yong Kong

TL;DR

The statistical distributions for sequencing by synthesis by taking into account the possibility that nucleotide incorporation may not be complete in each flow cycle are derived, expressed in terms of nucleotide probabilities of the target sequences and the nucleotide incorporation probabilities for each nucleotide.

Abstract

Sequencing by synthesis is used in many next-generation DNA sequencing technologies. Some of the technologies, especially those exploring the principle of single-molecule sequencing, allow incomplete nucleotide incorporation in each cycle. We derive statistical distributions for sequencing by synthesis by taking into account the possibility that nucleotide incorporation may not be complete in each flow cycle. The statistical distributions are expressed in terms of nucleotide probabilities of the target sequences and the nucleotide incorporation probabilities for each nucleotide. We give exact distributions both for fixed number of flow cycles and for fixed sequence length. Explicit formulas are derived for the mean and variance of these distributions. The results are generalizations of our previous work for pyrosequencing. Incomplete nucleotide incorporation leads to significant change in the mean and variance of the distributions, but still they can be approximated by normal distributions with the same mean and variance. The results are also generalized to handle sequence context dependent incorporation. The statistical distributions will be useful for instrument and software development for sequencing by synthesis platforms.

Statistical distributions of sequencing by synthesis with probabilistic nucleotide incorporation

TL;DR

The statistical distributions for sequencing by synthesis by taking into account the possibility that nucleotide incorporation may not be complete in each flow cycle are derived, expressed in terms of nucleotide probabilities of the target sequences and the nucleotide incorporation probabilities for each nucleotide.

Abstract

Sequencing by synthesis is used in many next-generation DNA sequencing technologies. Some of the technologies, especially those exploring the principle of single-molecule sequencing, allow incomplete nucleotide incorporation in each cycle. We derive statistical distributions for sequencing by synthesis by taking into account the possibility that nucleotide incorporation may not be complete in each flow cycle. The statistical distributions are expressed in terms of nucleotide probabilities of the target sequences and the nucleotide incorporation probabilities for each nucleotide. We give exact distributions both for fixed number of flow cycles and for fixed sequence length. Explicit formulas are derived for the mean and variance of these distributions. The results are generalizations of our previous work for pyrosequencing. Incomplete nucleotide incorporation leads to significant change in the mean and variance of the distributions, but still they can be approximated by normal distributions with the same mean and variance. The results are also generalized to handle sequence context dependent incorporation. The statistical distributions will be useful for instrument and software development for sequencing by synthesis platforms.

Paper Structure

This paper contains 12 sections, 41 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: The distribution of flow cycles for a fixed sequence length of $n=100$ base pairs. The nucleotide composition probabilities used here are $p_a=3/10=0.3$, $p_b=1/5=0.2$, $p_c = 1/5=0.2$, and $p_d=3/10=0.3$. The non-zero nucleotide incorporation probabilities are $\alpha_{aj} = [1/10, 1/5, 2/5, 3/10]$, $\alpha_{bj} = [3/10, 1/5, 1/10, 1/10, 1/10, 1/10, 1/10]$, $\alpha_{cj} = [3/10, 3/10, 3/10, 1/10]$, and $\alpha_{dj} = [2/5, 1/5, 1/5, 1/10, 1/10]$. The exact distribution is plotted as '+' and is calculated from Eq. \ref{['E:G_sum']}. The continuous curve is the normal distribution $N({\bar{f}} (n), \sigma_f^2 (n))$ of the same mean and variance as those of the exact distribution, where ${\bar{f}} (n)$ and $\sigma_f^2 (n)$ are calculated from Eqs. \ref{['E:fixed_length_avg']} and \ref{['E:fixed_length_var']}. The normal distribution shown here is $N(201.63, 211.5197)$.
  • Figure 2: The distributions of sequence length in base pairs for a fixed number of flow cycles $f=50$. The nucleotide composition probabilities $p_i$ and nucleotide incorporation probabilities $\alpha_{ij}$ used here are the same as in Figure \ref{['F:fixed_sequence_length']}. The exact distribution is plotted as '+' and is calculated from Eq. \ref{['E:G_sum']}. The continuous curve is the normal distribution $N({\bar{n}} (f), \sigma_n^2 (f))$ with the same mean and variance as those of the exact distribution, where ${\bar{n}} (f)$ and $\sigma_n^2 (f)$ are calculated from Eqs. \ref{['E:fixed_cycle_avg']} and \ref{['E:fixed_cycle_var']}. The normal distribution shown here is $N(25.0856, 13.1454)$.