Table of Contents
Fetching ...

A refinement of the Ewens sampling formula

Eugene Strahov

TL;DR

This work generalizes the Ewens sampling formula to an infinite-allele population model partitioned into $k$ allele classes with class-specific mutation rates, yielding a refined sampling distribution for the joint class-structured allele counts. It presents two complementary derivations: a backward-in-time coalescent with killing leading to a generalized Hoppe urn representation, and a forward-in-time construction using a multiple Poisson–Dirichlet distribution, both resulting in the same refined formula. The paper also develops the theory of Ewens multi-partition structures, links to wreath-product representations and a generalized Chinese restaurant process, and studies combinatorial objects like random set partitions. Applications include exact formulas and limit theorems for the numbers of alleles in each class and a Poisson-approximation for the sampling matrix, with implications for large-sample genetic data and multi-class mutation models. Overall, it provides a rigorous, multi-faceted framework for multi-class mutation models and connects probabilistic structures (PD, CRP, wreath products) to population-genetic sampling in a novel, unified way.

Abstract

We consider an infinitely-many neutral allelic model of population genetics where all alleles are divided into a finite number of classes, and each class is characterized by its own mutation rate. For this model the allelic composition of a sample taken from a very large population of genes is characterized by a random matrix, and the problem is to describe the joint distribution of the matrix entries. The answer is given by a new generalization of the classical Ewens sampling formula called the refined Ewens sampling formula in the present paper. We discuss a Poisson approximation for the refined Ewens sampling formula, and present its derivation by several methods. As an application we obtain limit theorems for the numbers of alleles in different asymptotic regimes.

A refinement of the Ewens sampling formula

TL;DR

This work generalizes the Ewens sampling formula to an infinite-allele population model partitioned into allele classes with class-specific mutation rates, yielding a refined sampling distribution for the joint class-structured allele counts. It presents two complementary derivations: a backward-in-time coalescent with killing leading to a generalized Hoppe urn representation, and a forward-in-time construction using a multiple Poisson–Dirichlet distribution, both resulting in the same refined formula. The paper also develops the theory of Ewens multi-partition structures, links to wreath-product representations and a generalized Chinese restaurant process, and studies combinatorial objects like random set partitions. Applications include exact formulas and limit theorems for the numbers of alleles in each class and a Poisson-approximation for the sampling matrix, with implications for large-sample genetic data and multi-class mutation models. Overall, it provides a rigorous, multi-faceted framework for multi-class mutation models and connects probabilistic structures (PD, CRP, wreath products) to population-genetic sampling in a novel, unified way.

Abstract

We consider an infinitely-many neutral allelic model of population genetics where all alleles are divided into a finite number of classes, and each class is characterized by its own mutation rate. For this model the allelic composition of a sample taken from a very large population of genes is characterized by a random matrix, and the problem is to describe the joint distribution of the matrix entries. The answer is given by a new generalization of the classical Ewens sampling formula called the refined Ewens sampling formula in the present paper. We discuss a Poisson approximation for the refined Ewens sampling formula, and present its derivation by several methods. As an application we obtain limit theorems for the numbers of alleles in different asymptotic regimes.
Paper Structure (29 sections, 17 theorems, 199 equations, 1 figure, 1 table)

This paper contains 29 sections, 17 theorems, 199 equations, 1 figure, 1 table.

Key Result

Theorem 1.1

Assume that $\mu_1=\frac{\theta_1}{4N}$, $\ldots$, $\mu_k=\frac{\theta_k}{4N}$, where $\theta_1>0$, $\ldots$, $\theta_k>0$ are fixed numbers and $2N$ is the size of the population. As $N\longrightarrow\infty$, the random variable $A_{N,j}^{(l)}(n)$ converges in distribution to a random variable $A_{ where $\{a_j^{(l)}:\;l=1,\ldots,k;\; j=1,\ldots,n \}$ are non-negative integers, and $(\theta)_n=\t

Figures (1)

  • Figure 1: An analogue of the Chinese restaurant process.

Theorems & Definitions (42)

  • Theorem 1.1
  • Proposition 2.1
  • proof
  • Definition 2.2
  • Proposition 2.3
  • proof
  • Proposition 3.1
  • proof
  • Remark 3.2
  • Proposition 3.3
  • ...and 32 more