Table of Contents
Fetching ...

Moonshine.jl: a Julia package for genome-scale model-based ancestral recombination graph inference

Patrick Fournier, Fabrice Larribe

Abstract

The ancestral recombination graph (ARG) is the model of choice in statistical genetics to model population ancestries. Software capable of simulating ARGs on a genome scale within a reasonable amount of time are now widely available for most practical use cases. While the inverse problem of inferring ancestries from a sample of haplotypes has seen major progress in the last decade, it does not enjoy the same level of advancement as its counterpart. Up until recently, even moderately sized samples could only be handled using heuristics. In recent years, the possibility of model-based inference for datasets closer to "real world" scenarios has become a reality, largely due to the development of threading-based samplers. This article introduces Moonshine.jl, a Julia package that has the ability, among other things, to infer ARGs for samples of thousands of human haplotypes of sizes on the order of hundreds of megabases within a reasonable amount of time. On recent hardware, our package is able to infer an ARG for samples of densely haplotyped (over one marker/kilobase) human chromosomes of sizes up to 10000 in well under a day on data simulated by msprime. Scaling up simulation on a compute cluster is straightforward thanks to a strictly single-threaded implementation. While model-based, it does not resort to threading but rather places restrictions on probability distributions typically used in simulation software in order to enforce sample consistency. In addition to being efficient, a strong emphasis is placed on ease of use and integration into the biostatistical software ecosystem.

Moonshine.jl: a Julia package for genome-scale model-based ancestral recombination graph inference

Abstract

The ancestral recombination graph (ARG) is the model of choice in statistical genetics to model population ancestries. Software capable of simulating ARGs on a genome scale within a reasonable amount of time are now widely available for most practical use cases. While the inverse problem of inferring ancestries from a sample of haplotypes has seen major progress in the last decade, it does not enjoy the same level of advancement as its counterpart. Up until recently, even moderately sized samples could only be handled using heuristics. In recent years, the possibility of model-based inference for datasets closer to "real world" scenarios has become a reality, largely due to the development of threading-based samplers. This article introduces Moonshine.jl, a Julia package that has the ability, among other things, to infer ARGs for samples of thousands of human haplotypes of sizes on the order of hundreds of megabases within a reasonable amount of time. On recent hardware, our package is able to infer an ARG for samples of densely haplotyped (over one marker/kilobase) human chromosomes of sizes up to 10000 in well under a day on data simulated by msprime. Scaling up simulation on a compute cluster is straightforward thanks to a strictly single-threaded implementation. While model-based, it does not resort to threading but rather places restrictions on probability distributions typically used in simulation software in order to enforce sample consistency. In addition to being efficient, a strong emphasis is placed on ease of use and integration into the biostatistical software ecosystem.

Paper Structure

This paper contains 24 sections, 1 theorem, 20 equations, 15 figures, 8 algorithms.

Key Result

Lemma 1

Let $K$ be the index of the sample returned by a run of the secretary sampler on a vector of size $m$ with threshold $t_0$, $t_m = \lfloor m t_0 \rfloor$, $p_i = p_{a b_i}$, and $K_{t_m} = \mathop{\mathrm{arg\,max}}\limits_{i = 1}^{t_m} \{ g_i + \log p_i \}$. Let $\widehat{K}$ be the index of the co

Figures (15)

  • Figure 1: Time and memory needed for the construction of a coalescent tree as a function of the number of haplotypes, haplotype length, and distance function. Sampling is exact ($t_0 = 1$) and no bias is applied ($c_0 = 1$). Construction with respect to either of the two distances yields identical memory usage since their computation does not involve memory allocation. Results are also available in \ref{['table:tree-build-time-benchmark']}.
  • Figure 2: Speedup in tree construction, probability of selecting the correct sequence and probability of early termination as a function of the sampling threshold $t_0$. No bias is applied ($c_0 = 1$). The probability of early termination is $1 - t_0$. As the number of candidate vertices increases, the probability of sampling correctly from the target distribution converges to $t_0 (1 - \log t_0)$ (depicted here). The exact probability for a single run is given by \ref{['lemma:secretary']}. Results are also available in \ref{['table:tree-sampling-threshold-benchmark']}.
  • Figure 3: Two types of recombination events. Blue and red boxes represent SNPs with the wild and derived alleles, respectively. Non-ancestral material is represented by yellow boxes. Mutation edges are marked with red dots and numbers indicating the mutating markers. Pink arrows indicate the breakpoints. When recoalescence occurs below the MRCA of the sample (gMRCA), the recoalescence vertex is "inserted" by deleting an edge from the graph and replacing it with two new edges incident to the recoalescence vertex; this is illustrated in \ref{['fig:recombination-below']}. By contrast, no deletion is required when recoalescence occurs above the gMRCA; as illustrated in \ref{['fig:recombination-above']}, the recoalescence vertex simply becomes the new gMRCA.
  • Figure 4: Types of recombination events. Pink arrows indicate the positions of breakpoints.
  • Figure 5: Schema of the MMN algorithm on an AVX-512 compliant architecture. After the ancestral edges for the interval of interest $\omega$ have been listed, the haplotypes of their parent and child vertices are divided into chunks of 8 markers, which are stored in two separate arrays: the first chunk in the "Parents" array corresponds to the parent of the first edge, the second chunk to the parent of the second edge, and so on. The "Children" array is constructed similarly. Each matching pair of cells from both arrays is then xored in a single operation, resulting in the "Xored Chunks" array. The bitwise conjunction between each chunk and the mask $m_{\omega}$ is then computed in a similar fashion. Note that the "Mask" array exists for illustrative purposes only and does not need to be instantiated in practice. The number of mutation edges is obtained by repeated bitshift and application of trailing_zeros on "Mutation Edges". The mutation edges can be obtained simultaneously via the indices of set bits.
  • ...and 10 more figures

Theorems & Definitions (2)

  • Lemma 1
  • proof