Statistical distributions of pyrosequencing
Yong Kong
TL;DR
The paper addresses the variability of pyrosequencing readouts by deriving exact distributions for two key quantities: the number of nucleotide flows needed for a fixed read length $n$ and the read length achievable with a fixed number of flows $f$. It uses a bivariate probability generating function framework to obtain a compact, symmetric form $G(x,y)=\frac{xy}{H}\big[1 - s_2(1-x)y + s_3(1-x)^2 y^2 - s_4(1-x)^3 y^3\big]$ with $H=1 - y + s_2(1-x)y^2 - s_3(1-x)^2 y^3 + s_4(1-x)^3 y^4$, from which exact means and variances follow: for fixed $n$, $\bar{f}(n)=s_2 n - s_2 + 1$ and $\sigma_f^2(n)=(s_2 - 3s_2^2 + 2s_3)n + (5s_2^2 - s_2 - 4s_3)$; for fixed $f$, $\bar{n}(f)\approx \frac{f}{s_2} + \frac{2s_3}{s_2^2} - 2$ and $\sigma_n^2(f)\approx \frac{s_2 - 3s_2^2 + 2s_3}{s_2^3}f$ (with small corrections). In the equal-probability case $p_a=p_b=p_c=p_d=\tfrac14$, these simplify to $\bar{f}(n)=\frac{3}{8}n+\frac{5}{8}$ and $\sigma_f^2(n)=\frac{5}{64}n+\frac{5}{64}$, and the corresponding $n$-distributions are well-approximated by $N(\bar{f},\sigma_f^2)$. The results include detailed expressions for distributions ending with each nucleotide and for fixed-$f$ vs fixed-$n$ regimes, and are supported by exact calculations with high numerical precision. These findings provide practical guidance for pyrosequencing instrument design, software development, and performance monitoring across different nucleotide compositions.
Abstract
Pyrosequencing is emerging as one of the important next-generation sequencing technologies. We derive the statistical distributions of this technique in terms of nucleotide probabilities of the target sequences. We give exact distributions both for fixed number of flow cycles and for fixed sequence length. Explicit formulas are derived for the mean and variance of these distributions. In both cases, the distributions can be approximated accurately by normal distributions with the same mean and variance. The statistical distributions will be useful for instrument and software development for pyrosequencing platforms.
