Multi-megabase scale genome interpretation with genetic language models
Frederik Träuble, Lachlan Stuart, Andreas Georgiou, Pascal Notin, Arash Mehrjou, Ron Schwessinger, Mathieu Chevalley, Kim Branson, Bernhard Schölkopf, Cornelia van Duijn, Debora Marks, Patrick Schwab
TL;DR
Phenformer introduces a multi-scale genetic language model that interprets whole-genome sequences by linking DNA sequence to cell-context–specific expression and disease directly from sequence, processing up to $88$ million base pairs. It employs a frozen Enformer-based sequence-to-expression backbone to generate token embeddings, followed by a Transformer with Pooling by Multihead Attention to predict disease risk, trained on over $150{,}000$ UK Biobank genomes. The results show that Phenformer identifies disease-associated cell types from sequence with better literature alignment than baselines, improves disease risk prediction across ancestries when ensembled with PRS methods (e.g., AUROC gains up to $4.2 ext{ ext{–}}11.19 ext{%}$), and reveals molecular subtypes with distinct comorbidity patterns. These findings demonstrate the feasibility and value of end-to-end, multi-megabase genome interpretation for mechanistic insight and personalized risk prediction, while acknowledging limitations in genome coverage and the need for ethical deployment considerations.
Abstract
Understanding how molecular changes caused by genetic variation drive disease risk is crucial for deciphering disease mechanisms. However, interpreting genome sequences is challenging because of the vast size of the human genome, and because its consequences manifest across a wide range of cells, tissues and scales -- spanning from molecular to whole organism level. Here, we present Phenformer, a multi-scale genetic language model that learns to generate mechanistic hypotheses as to how differences in genome sequence lead to disease-relevant changes in expression across cell types and tissues directly from DNA sequences of up to 88 million base pairs. Using whole genome sequencing data from more than 150 000 individuals, we show that Phenformer generates mechanistic hypotheses about disease-relevant cell and tissue types that match literature better than existing state-of-the-art methods, while using only sequence data. Furthermore, disease risk predictors enriched by Phenformer show improved prediction performance and generalisation to diverse populations. Accurate multi-megabase scale interpretation of whole genomes without additional experimental data enables both a deeper understanding of molecular mechanisms involved in disease and improved disease risk prediction at the level of individuals.
