Table of Contents
Fetching ...

Multi-megabase scale genome interpretation with genetic language models

Frederik Träuble, Lachlan Stuart, Andreas Georgiou, Pascal Notin, Arash Mehrjou, Ron Schwessinger, Mathieu Chevalley, Kim Branson, Bernhard Schölkopf, Cornelia van Duijn, Debora Marks, Patrick Schwab

TL;DR

Phenformer introduces a multi-scale genetic language model that interprets whole-genome sequences by linking DNA sequence to cell-context–specific expression and disease directly from sequence, processing up to $88$ million base pairs. It employs a frozen Enformer-based sequence-to-expression backbone to generate token embeddings, followed by a Transformer with Pooling by Multihead Attention to predict disease risk, trained on over $150{,}000$ UK Biobank genomes. The results show that Phenformer identifies disease-associated cell types from sequence with better literature alignment than baselines, improves disease risk prediction across ancestries when ensembled with PRS methods (e.g., AUROC gains up to $4.2 ext{ ext{–}}11.19 ext{%}$), and reveals molecular subtypes with distinct comorbidity patterns. These findings demonstrate the feasibility and value of end-to-end, multi-megabase genome interpretation for mechanistic insight and personalized risk prediction, while acknowledging limitations in genome coverage and the need for ethical deployment considerations.

Abstract

Understanding how molecular changes caused by genetic variation drive disease risk is crucial for deciphering disease mechanisms. However, interpreting genome sequences is challenging because of the vast size of the human genome, and because its consequences manifest across a wide range of cells, tissues and scales -- spanning from molecular to whole organism level. Here, we present Phenformer, a multi-scale genetic language model that learns to generate mechanistic hypotheses as to how differences in genome sequence lead to disease-relevant changes in expression across cell types and tissues directly from DNA sequences of up to 88 million base pairs. Using whole genome sequencing data from more than 150 000 individuals, we show that Phenformer generates mechanistic hypotheses about disease-relevant cell and tissue types that match literature better than existing state-of-the-art methods, while using only sequence data. Furthermore, disease risk predictors enriched by Phenformer show improved prediction performance and generalisation to diverse populations. Accurate multi-megabase scale interpretation of whole genomes without additional experimental data enables both a deeper understanding of molecular mechanisms involved in disease and improved disease risk prediction at the level of individuals.

Multi-megabase scale genome interpretation with genetic language models

TL;DR

Phenformer introduces a multi-scale genetic language model that interprets whole-genome sequences by linking DNA sequence to cell-context–specific expression and disease directly from sequence, processing up to million base pairs. It employs a frozen Enformer-based sequence-to-expression backbone to generate token embeddings, followed by a Transformer with Pooling by Multihead Attention to predict disease risk, trained on over UK Biobank genomes. The results show that Phenformer identifies disease-associated cell types from sequence with better literature alignment than baselines, improves disease risk prediction across ancestries when ensembled with PRS methods (e.g., AUROC gains up to ), and reveals molecular subtypes with distinct comorbidity patterns. These findings demonstrate the feasibility and value of end-to-end, multi-megabase genome interpretation for mechanistic insight and personalized risk prediction, while acknowledging limitations in genome coverage and the need for ethical deployment considerations.

Abstract

Understanding how molecular changes caused by genetic variation drive disease risk is crucial for deciphering disease mechanisms. However, interpreting genome sequences is challenging because of the vast size of the human genome, and because its consequences manifest across a wide range of cells, tissues and scales -- spanning from molecular to whole organism level. Here, we present Phenformer, a multi-scale genetic language model that learns to generate mechanistic hypotheses as to how differences in genome sequence lead to disease-relevant changes in expression across cell types and tissues directly from DNA sequences of up to 88 million base pairs. Using whole genome sequencing data from more than 150 000 individuals, we show that Phenformer generates mechanistic hypotheses about disease-relevant cell and tissue types that match literature better than existing state-of-the-art methods, while using only sequence data. Furthermore, disease risk predictors enriched by Phenformer show improved prediction performance and generalisation to diverse populations. Accurate multi-megabase scale interpretation of whole genomes without additional experimental data enables both a deeper understanding of molecular mechanisms involved in disease and improved disease risk prediction at the level of individuals.
Paper Structure (26 sections, 12 figures, 2 tables)

This paper contains 26 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Phenformer is a genetic language model that learns to connect individual genomes to changes in cell-type-specific expression to disease directly from sequence. Phenformer is an end-to-end multi-scale model that directly processes genomes following the information flow in molecular biologycrick1970central ⟨sequence → cell context → expression → phenotype⟩. A variable number of $m$ windows of 196 kilobases (kb) centred around the transcription start site (TSS) of genes are first transformed by a sequence-to-expression backbone (Enformeravsec2021effective) that was pretrained to predict expression and chromatin accessibility across a wide range of cell types. Tokens of sequence embeddings (3072 dimensions per TSS) are then passed to an expression-to-phenotype core that consists of multiple transformer encoder layersvaswani2017attention that later aggregate information across sequence embeddings using Pooling by Multihead Attentionlee2019set (PMA). A prediction head outputs individual risk predictions for the phenotype of interest. Phenformer integrates up to 88 million base pairs - almost 3% of an individual genome and an order of magnitude larger than the largest existing genetic language modelnguyen2023hyenadna - to highlight potential molecular mechanisms underlying diseases, predict disease risk, and identify disease subtypes.
  • Figure 2: Phenformer identifies disease-associated cell and tissue types.a. Phenformer independently recovers cell and tissue type to disease associations previously reported in literature as measured via F1 score through enrichment (at least 5% enrichment as a threshold for Phenformer). b. We compared Phenformer to state-of-the-art cell type identification methods that leverage genetic and/or single cell RNA sequencing (scRNAseq) dataongen2017estimatingfinucane2018heritabilitywatanabe2019geneticjagadeesh2022identifyingamariuta2023modeling and found that Phenformer more accurately identified the cell types reported in literature to be associated with disease by average F1 score (dots represent per-disease differences). For fairness, the comparison was conducted in pairwise fashion on the overlap of diseases and cell types for which predictions were available for both Phenformer and the method being compared to. c. An overview of categories of cell types highlighted by Phenformer to be enriched in differential disease risk predictions (top) and - for comparison - an overview of the cell type-disease associations supported by scientific literature (bottom). Larger size circles indicate that more members of the respective category of cell type were ranked highly by Phenformer (\ref{['fig:interpretation']}) or scientific literature (see Section "\ref{['par:methods_literature']}" for methodology), respectively. Grey circles indicate that at least one member of the cell type category was ranked in the top 30 most predicted differential cell types for a disease for Phenformer or that 5 or more abstracts scoring highly for evidence of association between the cell type and disease were found in literature. Cell types were assigned to the most specific category shown, i.e. mast cells were not also part of the myeloid cells category.
  • Figure 3: Phenformer improves prediction of disease risk from whole genomes. We used ensembles of Phenformer (trained on approximately 3% of the whole genome) and state-of-the-art polygenic risk score (PRS) methods (Lassosum, LDpred2, PRS-CSx, Pthres, C+T) to improve risk prediction performance across 6 major diseases (psoriasis, type 1 diabetes, type 2 diabetes, diabetic retinopathy, chronic obstructive pulmonary disease [COPD] and hypothyroidism) on held-out test set individuals with a. mixed ancestry and with b. non-European ancestry. We found that enhancing PRS methods with Phenformer predictions significantly (p $\leq 0.05$; Mann-Whitney Wilcoxon test for superiority) improves disease risk prediction compared to predicting risk using only the ensemble partner for 86.7% and 96.7% of diseases and ensemble partners with average performance benefits across diseases of up to 4.2% and 11.19% higher area under the receiver operator curve (AUROC) in populations of mixed ancestry and non-European ancestry, respectively. When restricting the evaluation to the same subset of approximately 3% of the genome sequence that Phenformer was trained on (corresponding to sequence windows around 512 genes), Phenformer achieves up to 5.49% and 14.59% higher prediction performance in terms of average AUROC across diseases for populations of c. mixed ancestry and with d. non-European ancestry, respectively. Uncertainty was evaluated using bootstrap resampling with 2000 samples.
  • Figure 4: Phenformer provides cell type rankings for sequence windows associated with the liver in psoriasis and the small intestine in T1D. Phenformer attributions highlight the sequence window around the TSSs of SELENOW (top left) and SPX (top right) as potentially relevant for differential expression changes in liver and hepatocyte cellular contexts in psoriasis-affected individuals (top row), and CYP7A1 (bottom left) and GIMD1 (bottom right) as potentially relevant in the small intestine in T1D-affected individuals (bottom row). We note that SELENOW (CRX, EHD2, NOP53, TPRX1, TPRX2), SPX (GOLT1B, GYS2, PYROXD1, RECQL), CYP7A1 (SDCBP, UBXN2B) and GIMD1 (AIMP1, TBCK) 196 kb sequence windows overlap with multiple other genes which may partially or fully explain the importance assigned to the respective sequence windows (see Section "\ref{['par:interpretation_meaning']}" for additional guidance on interpretation). The ability of Phenformer to highlight cell and tissue contexts of importance for particular gene sequence windows may provide hypotheses that may help substantiate known - but not yet molecularly understood - disease-associated pathologies, such as for example, increased frequency and severity of non-alcoholic fatty liver disease (NAFLD) in psoriasis patientsprussick2015nonalcoholic and changes in cholesterol synthesis and absorption markers in T1D patients semova2019type.
  • Figure 5: Phenformer embeddings enable grouping of individuals by their underlying differences in disease-related molecular mechanisms. Latent space embeddings of Phenformer can be used to subtype individuals according to their differences in molecular processes induced by genetic variation, enabling a fine-grained understanding of molecular subtypes in broader disease categories. Circles and plus (+) symbols represent diagnosed and an equal amount of reference undiagnosed individuals (not used for clustering), respectively. Using Phenformer trained to predict psoriasis (top) and diabetic retinopathy (bottom; visualised using UMAP mcinnes2018umap), we identified molecular subtypes (colors with associated cluster labels). Molecular subtypes were associated with differences in terms of co-morbidity rates (pie chart insets) among diagnosed cluster members (highlighted for clusters with the largest differences). We find statistically significant (* = p $\leq 0.05$; $\chi^2$ test) differences in dermatitis, seborrheic dermatitis and T1D comorbidity rates in psoriasis subtypes, and in dermatitis in diabetic retinopathy subtypes - suggesting differences in underlying molecular processes identified by the Phenformer embeddings of individual genomes. Subtype differences in T1D ($p=0.0684$) and ulcerative colitis ($p=0.1374$) in diabetic retinopathy do not reach significance (n.s.).
  • ...and 7 more figures