Table of Contents
Fetching ...

Distributions of positive signals in pyrosequencing

Yong Kong

TL;DR

This work derives exact distributions for the number of positive signals $r$ in pyrosequencing pyrograms, modeled under fixed r-seq length (FRLM) and fixed flow cycle (FFCM) frameworks using probability generating functions. By solving recurrences and obtaining closed-form GFs, the authors obtain explicit means and variances for $r$ and $f$, show Gaussian limiting behavior, and reveal a robust, approximate relation $ar{r}(f) oughly 2f$ independent of nucleotide probabilities. A key practical result is that simulations and theory align, validating the models for predicting pyrogram signal distributions, which have implications for base-calling thresholds and software design. The work also clarifies how the distributions of $f$, $n$, and $r$ relate, including transitive intuitions and the impact of equal vs unequal nucleotide probabilities on variance, enabling improved understanding of pyrosequencing data generation and analysis.

Abstract

Pyrosequencing is one of the important next-generation sequencing technologies. We derive the distribution of the number of positive signals in pyrograms of this sequencing technology as a function of flow cycle numbers and nucleotide probabilities of the target sequences. As for the distribution of sequence length, we also derive the distribution of positive signals for the fixed flow cycle model. Explicit formulas are derived for the mean and variance of the distributions. A simple result for the mean of the distribution is that the mean number of positive signals in a pyrogram is approximately twice the number of flow cycles, regardless of nucleotide probabilities. The statistical distributions will be useful for instrument and software development for pyrosequencing and other related platforms.

Distributions of positive signals in pyrosequencing

TL;DR

This work derives exact distributions for the number of positive signals in pyrosequencing pyrograms, modeled under fixed r-seq length (FRLM) and fixed flow cycle (FFCM) frameworks using probability generating functions. By solving recurrences and obtaining closed-form GFs, the authors obtain explicit means and variances for and , show Gaussian limiting behavior, and reveal a robust, approximate relation independent of nucleotide probabilities. A key practical result is that simulations and theory align, validating the models for predicting pyrogram signal distributions, which have implications for base-calling thresholds and software design. The work also clarifies how the distributions of , , and relate, including transitive intuitions and the impact of equal vs unequal nucleotide probabilities on variance, enabling improved understanding of pyrosequencing data generation and analysis.

Abstract

Pyrosequencing is one of the important next-generation sequencing technologies. We derive the distribution of the number of positive signals in pyrograms of this sequencing technology as a function of flow cycle numbers and nucleotide probabilities of the target sequences. As for the distribution of sequence length, we also derive the distribution of positive signals for the fixed flow cycle model. Explicit formulas are derived for the mean and variance of the distributions. A simple result for the mean of the distribution is that the mean number of positive signals in a pyrogram is approximately twice the number of flow cycles, regardless of nucleotide probabilities. The statistical distributions will be useful for instrument and software development for pyrosequencing and other related platforms.

Paper Structure

This paper contains 12 sections, 37 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Distributions of positive signals when flow cycles $f=100$, for both equal and unequal nucleotide probabilities. The unequal nucleotide probabilities used here are $p_a=1/3=0.333$, $p_b=1/11=0.091$, $p_c = 100/231=0.433$, and $p_d=1/7=0.143$. The exact distributions are calculated from Eqs. \ref{['E:prf']} and \ref{['E:G_sum']}. The continuous curves are the normal distributions $N({\bar{r}} (f), \sigma_r^2 (f))$ of the same mean and variance as those of the exact distributions, where ${\bar{r}} (f)$ and $\sigma_r^2 (f)$ are calculated from Eqs. \ref{['E:avg_fixed_f']} and \ref{['E:var_fixed_f']}. The two normal distributions shown here are $N(199.167, 33.681)$ and $N(199.130, 26.316)$, for equal and unequal nucleotide probabilities, respectively.
  • Figure 2: Distributions of the number of flow cycles when the number of positive signals $r=200$, for both equal and unequal nucleotide probabilities. The unequal nucleotide probabilities used here are the same as in Figure \ref{['F:fixed_f']}. The exact distributions are calculated from Eqs. \ref{['E:pfr']} and \ref{['E:G_sum']}. The continuous curves are the normal distributions $N({\bar{f}} (r), \sigma_f^2 (r))$ of the same mean and variance as those of the exact distributions, where ${\bar{f}} (r)$ and $\sigma_f^2 (r)$ are calculated from Eqs. \ref{['E:f_r_avg']} and \ref{['E:f_r_var']}. The two normal distributions shown here are $N(100.5, 8.448)$ and $N(100.5, 6.603)$, for equal and unequal nucleotide probabilities, respectively.