Table of Contents
Fetching ...

Simulation-based inference of yeast centromeres

Eloïse Touron, Pedro L. C. Rodrigues, Julyan Arbel, Nelle Varoquaux, Michael Arbel

TL;DR

This work tackles the problem of inferring centromere locations in yeasts from Hi-C maps, where traditional methods rely on pre-localization and yield point estimates. It introduces a Bayesian, simulation-based framework that uses a fast, simplified simulator to generate synthetic contact maps and yields a posterior $p(\theta|C_{\text{ref}})$ rather than a single location. The authors develop two inference pipelines, SMC-ABC and SNPE-CNN, to leverage either metric-based or neural-posterior approaches, respectively, and demonstrate their performance on the yeast genome in both a small (3 chromosomes) and a large (16 chromosomes) setting. The method quantifies uncertainty, does not require initialization, and is scalable, with strong accuracy in the small-genome case and partial success in the full-genome case, suggesting avenues for further improvement such as transformer-based summaries for large-scale applications.

Abstract

The chromatin folding and the spatial arrangement of chromosomes in the cell play a crucial role in DNA replication and genes expression. An improper chromatin folding could lead to malfunctions and, over time, diseases. For eukaryotes, centromeres are essential for proper chromosome segregation and folding. Despite extensive research using de novo sequencing of genomes and annotation analysis, centromere locations in yeasts remain difficult to infer and are still unknown in most species. Recently, genome-wide chromosome conformation capture coupled with next-generation sequencing (Hi-C) has become one of the leading methods to investigate chromosome structures. Some recent studies have used Hi-C data to give a point estimate of each centromere, but those approaches highly rely on a good pre-localization. Here, we present a novel approach that infers in a stochastic manner the locations of all centromeres in budding yeast based on both the experimental Hi-C map and simulated contact maps.

Simulation-based inference of yeast centromeres

TL;DR

This work tackles the problem of inferring centromere locations in yeasts from Hi-C maps, where traditional methods rely on pre-localization and yield point estimates. It introduces a Bayesian, simulation-based framework that uses a fast, simplified simulator to generate synthetic contact maps and yields a posterior rather than a single location. The authors develop two inference pipelines, SMC-ABC and SNPE-CNN, to leverage either metric-based or neural-posterior approaches, respectively, and demonstrate their performance on the yeast genome in both a small (3 chromosomes) and a large (16 chromosomes) setting. The method quantifies uncertainty, does not require initialization, and is scalable, with strong accuracy in the small-genome case and partial success in the full-genome case, suggesting avenues for further improvement such as transformer-based summaries for large-scale applications.

Abstract

The chromatin folding and the spatial arrangement of chromosomes in the cell play a crucial role in DNA replication and genes expression. An improper chromatin folding could lead to malfunctions and, over time, diseases. For eukaryotes, centromeres are essential for proper chromosome segregation and folding. Despite extensive research using de novo sequencing of genomes and annotation analysis, centromere locations in yeasts remain difficult to infer and are still unknown in most species. Recently, genome-wide chromosome conformation capture coupled with next-generation sequencing (Hi-C) has become one of the leading methods to investigate chromosome structures. Some recent studies have used Hi-C data to give a point estimate of each centromere, but those approaches highly rely on a good pre-localization. Here, we present a novel approach that infers in a stochastic manner the locations of all centromeres in budding yeast based on both the experimental Hi-C map and simulated contact maps.

Paper Structure

This paper contains 21 sections, 4 equations, 7 figures, 4 algorithms.

Figures (7)

  • Figure 1: Inference using ABC-Pearson, ABC-CNN, and SBI-CNN (a). Color shades increase from lightest to darkest across rounds. Densities are estimated with the $5\%$ best $\theta$ according to the ABC criterion or sampled from the flow. We also report the mean Euclidean distance between $\theta$ and $\theta_\text{ref}$, computed over the $5\%$ best-performing samples in the top right corner (b). The horizontal dashed line stands for the resolution of the contact map $C_\text{ref}$ (in bp) in the top right figure. Results with SBI-CNN are uniformly better and both approaches based on data-driven summary statistics have errors smaller than the resolution of the contact maps.
  • Figure 2: Inference using ABC-Pearson, ABC-CNN, and SBI-CNN. Color shades increase from lightest to darkest across rounds. Densities are estimated with the $5\%$ best $\theta$ according to the ABC criterion or sampled from the flow. In some dimensions, the densities are very peaky and centered around $\theta_i$ (e.g. chromosome 4, 13, 15) but in others, the inference is not precise (e.g. chromosome 1, 6, 10). Data-driven summary statistics approaches do not outperform Pearson correlation-based method.
  • Figure 3: Process to construct a contact map in the case of $2$ chromosomes.
  • Figure 4: Hi-C map and our simulated map in the case of a small genome (resolution 32 kb).
  • Figure 5: Inference using ABC-Pearson, ABC-CNN, and SBI-CNN from synthetic data (a). Color shades increase from lightest to darkest across rounds. Densities are estimated with the $5\%$ best $\theta$ according to the ABC criterion or sampled from the flow. We also report the mean Euclidean distance between $\theta$ and $\theta_\text{ref}$, computed over the $5\%$ best-performing samples in the top right corner (b). The horizontal dashed line stands for the resolution of the contact map $C$ (in bp) in the top right figure. Results with data-driven summary statistics approaches are uniformly better even if all approaches have errors smaller than the resolution of the contact maps.
  • ...and 2 more figures