Table of Contents
Fetching ...

Length distribution of sequencing by synthesis: fixed flow cycle model

Yong Kong

TL;DR

This work tackles the problem of quantifying read-length distributions in sequencing-by-synthesis under probabilistic nucleotide incorporation. It introduces a fixed flow cycle model (FFCM) that fixes the number of flow cycles and yields the distribution of read length through generating-function techniques, complementing the prior fixed sequence length model (FSLM). The authors derive exact GF expressions for the probability distributions and provide closed-form approximations for the mean and variance of read length as linear functions of the number of flow cycles, with explicit results for complete incorporation and extensions to incomplete incorporation. They validate the approach with simulations and show that normal approximations fit well for practical cycle counts, illustrating the method’s potential for design, quality control, and analysis of SBS platforms across bulk and single-molecule contexts.

Abstract

Sequencing by synthesis is the underlying technology for many next-generation DNA sequencing platforms. We developed a new model, the fixed flow cycle model, to derive the distributions of sequence length for a given number of flow cycles under the general conditions where the nucleotide incorporation is probabilistic and may be incomplete, as in some single-molecule sequencing technologies. Unlike the previous model, the new model yields the probability distribution for the sequence length. Explicit closed form formulas are derived for the mean and variance of the distribution.

Length distribution of sequencing by synthesis: fixed flow cycle model

TL;DR

This work tackles the problem of quantifying read-length distributions in sequencing-by-synthesis under probabilistic nucleotide incorporation. It introduces a fixed flow cycle model (FFCM) that fixes the number of flow cycles and yields the distribution of read length through generating-function techniques, complementing the prior fixed sequence length model (FSLM). The authors derive exact GF expressions for the probability distributions and provide closed-form approximations for the mean and variance of read length as linear functions of the number of flow cycles, with explicit results for complete incorporation and extensions to incomplete incorporation. They validate the approach with simulations and show that normal approximations fit well for practical cycle counts, illustrating the method’s potential for design, quality control, and analysis of SBS platforms across bulk and single-molecule contexts.

Abstract

Sequencing by synthesis is the underlying technology for many next-generation DNA sequencing platforms. We developed a new model, the fixed flow cycle model, to derive the distributions of sequence length for a given number of flow cycles under the general conditions where the nucleotide incorporation is probabilistic and may be incomplete, as in some single-molecule sequencing technologies. Unlike the previous model, the new model yields the probability distribution for the sequence length. Explicit closed form formulas are derived for the mean and variance of the distribution.

Paper Structure

This paper contains 18 sections, 41 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The distribution of sequence length for the first $10$ cycles ($f=1,2, \dotsc, 10$) with complete nucleotide incorporation. The nucleotide composition probabilities used here are $p_a = 1/3 = 0.3333$, $p_b = 1/11 = 0.0909$, $p_c = 100/231 = 0.4329$, and $p_d = 1/7 = 0.1429$. The distribution is calculated from Eq. \ref{['E:pyro_G']}.
  • Figure 2: The distribution of sequence length with complete nucleotide incorporation for a fixed flow cycle of $f=100$. The nucleotide composition probabilities used here are the same as in Figure \ref{['F:pyro_f1_10']}. The exact distribution is plotted as '+' and is calculated from Eq. \ref{['E:pyro_G']}. The continuous curve is the normal distribution $N({\bar{n}} (f), \sigma^2 (f))$ of the same mean and variance as those of the exact distribution, where ${\bar{n}} (f)$ and $\sigma^2 (f)$ are calculated from Eqs. \ref{['E:pyro_avg']} and \ref{['E:pyro_var']}. The normal distribution shown here is $N(296.6452, 221.4998)$.
  • Figure 3: The distribution of sequence length for the first $10$ cycles ($f=1,2, \dots, 10$) with incomplete nucleotide incorporation. The nucleotide composition probabilities used here are the same as in Figure \ref{['F:pyro_f1_10']}. The hypothetical non-zero nucleotide incorporation probabilities are $\alpha_{{j}}^{(a)} = [6/55, 1/2, 3/10, 1/11]$, $\alpha_{j}^{(b)} = [19/60, 1/4, 1/3, 1/10]$, $\alpha_{j}^{(c)} = [407/630, 1/7, 1/10, 1/9]$, and $\alpha_{j}^{(d)} = [17/40, 1/5, 1/4, 1/8]$. The distribution is calculated from Eq. \ref{['E:sbs_G']}.
  • Figure 4: The distribution of sequence length with incomplete nucleotide incorporation for a fixed flow cycle of $f=100$. The nucleotide composition probabilities and the nucleotide incorporation probabilities used here are the same as in Figure \ref{['F:sbs_f1_10']}. The exact distribution is plotted as '+' and is calculated from Eq. \ref{['E:sbs_G']}. The continuous curve is the normal distribution $N({\bar{n}} (f), \sigma^2 (f))$ of the same mean and variance as those of the exact distribution, where ${\bar{n}} (f)$ and $\sigma^2 (f)$ are calculated from Eqs. \ref{['E:sbs_avg']} and \ref{['E:sbs_var']}. The normal distribution shown here is $N(73.7228, 47.4722)$.
  • Figure 5: The distributions of sequence length for a fixed flow cycle of $f=100$ for complete and incomplete nucleotide incorporation. The curve on the right is for complete nucleotide incorporation, the curve on the left is for incomplete nucleotide incorporation. The nucleotide composition probabilities and the nucleotide incorporation probabilities are the same as in Figure \ref{['F:pyro_f100']} and Figure \ref{['F:sbs_f100']}.