Simulation-based inference of yeast centromeres
Eloïse Touron, Pedro L. C. Rodrigues, Julyan Arbel, Nelle Varoquaux, Michael Arbel
TL;DR
This work tackles the problem of inferring centromere locations in yeasts from Hi-C maps, where traditional methods rely on pre-localization and yield point estimates. It introduces a Bayesian, simulation-based framework that uses a fast, simplified simulator to generate synthetic contact maps and yields a posterior $p(\theta|C_{\text{ref}})$ rather than a single location. The authors develop two inference pipelines, SMC-ABC and SNPE-CNN, to leverage either metric-based or neural-posterior approaches, respectively, and demonstrate their performance on the yeast genome in both a small (3 chromosomes) and a large (16 chromosomes) setting. The method quantifies uncertainty, does not require initialization, and is scalable, with strong accuracy in the small-genome case and partial success in the full-genome case, suggesting avenues for further improvement such as transformer-based summaries for large-scale applications.
Abstract
The chromatin folding and the spatial arrangement of chromosomes in the cell play a crucial role in DNA replication and genes expression. An improper chromatin folding could lead to malfunctions and, over time, diseases. For eukaryotes, centromeres are essential for proper chromosome segregation and folding. Despite extensive research using de novo sequencing of genomes and annotation analysis, centromere locations in yeasts remain difficult to infer and are still unknown in most species. Recently, genome-wide chromosome conformation capture coupled with next-generation sequencing (Hi-C) has become one of the leading methods to investigate chromosome structures. Some recent studies have used Hi-C data to give a point estimate of each centromere, but those approaches highly rely on a good pre-localization. Here, we present a novel approach that infers in a stochastic manner the locations of all centromeres in budding yeast based on both the experimental Hi-C map and simulated contact maps.
