Table of Contents
Fetching ...

Statistical distributions of pyrosequencing

Yong Kong

TL;DR

The paper addresses the variability of pyrosequencing readouts by deriving exact distributions for two key quantities: the number of nucleotide flows needed for a fixed read length $n$ and the read length achievable with a fixed number of flows $f$. It uses a bivariate probability generating function framework to obtain a compact, symmetric form $G(x,y)=\frac{xy}{H}\big[1 - s_2(1-x)y + s_3(1-x)^2 y^2 - s_4(1-x)^3 y^3\big]$ with $H=1 - y + s_2(1-x)y^2 - s_3(1-x)^2 y^3 + s_4(1-x)^3 y^4$, from which exact means and variances follow: for fixed $n$, $\bar{f}(n)=s_2 n - s_2 + 1$ and $\sigma_f^2(n)=(s_2 - 3s_2^2 + 2s_3)n + (5s_2^2 - s_2 - 4s_3)$; for fixed $f$, $\bar{n}(f)\approx \frac{f}{s_2} + \frac{2s_3}{s_2^2} - 2$ and $\sigma_n^2(f)\approx \frac{s_2 - 3s_2^2 + 2s_3}{s_2^3}f$ (with small corrections). In the equal-probability case $p_a=p_b=p_c=p_d=\tfrac14$, these simplify to $\bar{f}(n)=\frac{3}{8}n+\frac{5}{8}$ and $\sigma_f^2(n)=\frac{5}{64}n+\frac{5}{64}$, and the corresponding $n$-distributions are well-approximated by $N(\bar{f},\sigma_f^2)$. The results include detailed expressions for distributions ending with each nucleotide and for fixed-$f$ vs fixed-$n$ regimes, and are supported by exact calculations with high numerical precision. These findings provide practical guidance for pyrosequencing instrument design, software development, and performance monitoring across different nucleotide compositions.

Abstract

Pyrosequencing is emerging as one of the important next-generation sequencing technologies. We derive the statistical distributions of this technique in terms of nucleotide probabilities of the target sequences. We give exact distributions both for fixed number of flow cycles and for fixed sequence length. Explicit formulas are derived for the mean and variance of these distributions. In both cases, the distributions can be approximated accurately by normal distributions with the same mean and variance. The statistical distributions will be useful for instrument and software development for pyrosequencing platforms.

Statistical distributions of pyrosequencing

TL;DR

The paper addresses the variability of pyrosequencing readouts by deriving exact distributions for two key quantities: the number of nucleotide flows needed for a fixed read length and the read length achievable with a fixed number of flows . It uses a bivariate probability generating function framework to obtain a compact, symmetric form with , from which exact means and variances follow: for fixed , and ; for fixed , and (with small corrections). In the equal-probability case , these simplify to and , and the corresponding -distributions are well-approximated by . The results include detailed expressions for distributions ending with each nucleotide and for fixed- vs fixed- regimes, and are supported by exact calculations with high numerical precision. These findings provide practical guidance for pyrosequencing instrument design, software development, and performance monitoring across different nucleotide compositions.

Abstract

Pyrosequencing is emerging as one of the important next-generation sequencing technologies. We derive the statistical distributions of this technique in terms of nucleotide probabilities of the target sequences. We give exact distributions both for fixed number of flow cycles and for fixed sequence length. Explicit formulas are derived for the mean and variance of these distributions. In both cases, the distributions can be approximated accurately by normal distributions with the same mean and variance. The statistical distributions will be useful for instrument and software development for pyrosequencing platforms.

Paper Structure

This paper contains 15 sections, 20 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The distributions of flow cycles for a fixed sequence length of $n=250$ base pairs, for both equal nucleotide probability (on the right) and unequal nucleotide probabilities (on the left). The unequal nucleotide probabilities used here are $p_a=1/3=0.3333$, $p_b=1/11=0.0909$, $p_c = 100/231=0.4329$, and $p_d=1/7=0.1429$. The exact distributions are calculated from Eq. (\ref{['E:G_sum']}). The continuous curves are the normal distributions $N({\bar{f}} (n), \sigma_f^2 (n))$ of the same mean and variance as those of the exact distributions, where ${\bar{f}} (n)$ and $\sigma_f^2 (n)$ are calculated from Eqs. (\ref{['E:fixed_length_avg']}) and (\ref{['E:fixed_length_var']}). The two normal distributions shown here are $N(94.375, 19.609375)$ and $N(84.765278, 21.121065)$, for equal and unequal nucleotide probabilities, respectively.
  • Figure 2: The distributions of sequence length in base pairs for a fixed number of flow cycles $f=100$, for both equal nucleotide probability (on the left) and unequal nucleotide probabilities (on the right). The unequal nucleotide probabilities used here are the same as in Figure \ref{['F:fixed_sequence_length']}. The exact distributions are calculated from Eq. (\ref{['E:G_sum']}). The continuous curves are the normal distributions $N({\bar{n}} (f), \sigma_n^2 (f))$ with the same mean and variance as those of the exact distributions, where ${\bar{n}} (f)$ and $\sigma_n^2 (f)$ are calculated from Eqs. (\ref{['E:fixed_cycle_avg']}) and (\ref{['E:fixed_cycle_var']}). The two normal distributions shown here are $N(265.5555556, 148.3950617)$ and $N(296.0312085, 221.46233357)$, for equal and unequal nucleotide probabilities, respectively.
  • Figure 3: The distributions of individual flow cycles for a fixed sequence length of $n=250$ base pairs. The nucleotide probabilities used here are the same as those in the unequal probability case in Figure \ref{['F:fixed_sequence_length']}. These exact distributions are calculated from Eqs. (\ref{['E:G_a']}), (\ref{['E:G_b']}), (\ref{['E:G_c']}), and (\ref{['E:G_d']}) in section \ref{['S:exact']}. The continuous curves are the normal distributions $N({\bar{f}_i} (n), \sigma_i^2 (n))$ with the same means ${\bar{f}_i} (n)$ and variances $\sigma_{f_i}^2 (n)$ as those of the exact distributions, which are calculated from Eqs. (\ref{['E:fixed_length_ind_avg']}) and (\ref{['E:fixed_length_ind_var']}) for $i=a,b,c,d$. The normal distributions are scaled by the normalization factors of Eq. (\ref{['E:normal_x']}).
  • Figure 4: The distributions of of sequence length that can be determined with the number of flow cycles $f=100$, and with the last flow ending in the four different nucleotides. The nucleotide probabilities used here are the same as those in the unequal probability case in Figure \ref{['F:fixed_sequence_length']}. These exact distributions are calculated from Eqs. (\ref{['E:G_a']}), (\ref{['E:G_b']}), (\ref{['E:G_c']}), and (\ref{['E:G_d']}) in section \ref{['S:exact']}. The continuous curves are the normal distributions $N({\bar{n}_i} (f), \sigma_{n_i}^2 (f))$ with the same means ${\bar{n}_i} (f)$ and variances $\sigma_{n_i}^2 (f)$ as those of the exact distributions, which are calculated from Eqs. (\ref{['E:fixed_cycle_ind_avg']}) and (\ref{['E:fixed_cycle_ind_var']}) for $i=a,b,c,d$. The normal distributions are scaled by the normalization factors of Eq. (\ref{['E:normal_y']}).