DYNA: Disease-Specific Language Model for Variant Pathogenicity

Huixin Zhan; Zijun Zhang

DYNA: Disease-Specific Language Model for Variant Pathogenicity

Huixin Zhan, Zijun Zhang

TL;DR

DYNA addresses disease-specificity gaps in unsupervised VEPs by fine-tuning genomic foundations with a Siamese architecture to capture disease-context signals. It introduces two losses, $PLLR$ for coding VEPs and a contrastive loss for non-coding VEPs, enabling effective learning from small, rare variant sets. Across cardiovascular coding variants and splicing-related non-coding variants, DYNA achieves superior intra-gene and unseen-gene generalization, with strong ClinVar replication and MFASS-based non-coding VEP improvements, including zero-shot and few-shot transfer. These results highlight a practical path to disease-tailored clinical variant interpretation, capable of informing precision medicine for CM/ARM and splicing-related diseases.

Abstract

Clinical variant classification of pathogenic versus benign genetic variants remains a challenge in clinical genetics. Recently, the proposition of genomic foundation models has improved the generic variant effect prediction (VEP) accuracy via weakly-supervised or unsupervised training. However, these VEPs are not disease-specific, limiting their adaptation at the point of care. To address this problem, we propose DYNA: Disease-specificity fine-tuning via a Siamese neural network broadly applicable to all genomic foundation models for more effective variant effect predictions in disease-specific contexts. We evaluate DYNA in two distinct disease-relevant tasks. For coding VEPs, we focus on various cardiovascular diseases, where gene-disease relationships of loss-of-function vs. gain-of-function dictate disease-specific VEP. For non-coding VEPs, we apply DYNA to an essential post-transcriptional regulatory axis of RNA splicing, the most common non-coding pathogenic mechanism in established clinical VEP guidelines. In both cases, DYNA fine-tunes various pre-trained genomic foundation models on small, rare variant sets. The DYNA fine-tuned models show superior performance in the held-out rare variant testing set and are further replicated in large, clinically-relevant variant annotations in ClinVAR. Thus, DYNA offers a potent disease-specific variant effect prediction method, excelling in intra-gene generalization and generalization to unseen genetic variants, making it particularly valuable for disease associations and clinical applicability.

DYNA: Disease-Specific Language Model for Variant Pathogenicity

TL;DR

DYNA addresses disease-specificity gaps in unsupervised VEPs by fine-tuning genomic foundations with a Siamese architecture to capture disease-context signals. It introduces two losses,

for coding VEPs and a contrastive loss for non-coding VEPs, enabling effective learning from small, rare variant sets. Across cardiovascular coding variants and splicing-related non-coding variants, DYNA achieves superior intra-gene and unseen-gene generalization, with strong ClinVar replication and MFASS-based non-coding VEP improvements, including zero-shot and few-shot transfer. These results highlight a practical path to disease-tailored clinical variant interpretation, capable of informing precision medicine for CM/ARM and splicing-related diseases.

Abstract

Paper Structure (29 sections, 8 equations, 16 figures, 4 tables)

This paper contains 29 sections, 8 equations, 16 figures, 4 tables.

Main
Results
Overview of the DYNA Framework
Assessing DYNA's intra-gene generalization ability on Inherited CM and ARM
DYNA effectively identifies pathogenic and benign rare missense genetic variants over ESM1b.
DYNA outperforms baseline methods on cardiovascular diseases.
Replication of DYNA in ClinVar.
DYNA generalizes to unseen disease-relevant genes.
Assessing DYNA's Generalization Ability for Non-Coding VEPs on MFASS
DYNA outperforms other genomic foundation models.
DYNA shows generalization ability to unseen clinically-relevant splicing non-coding VEPs.
DYNA improves the accuracy of non-coding VEPs on splicing-related diseases.
Discussion
Methods
Pseudo-Log-Likelihood Ratio for Coding VEPs
...and 14 more sections

Figures (16)

Figure 1: a In genomic foundation models, we analyze two primary types of biological inputs: protein sequences, representing the coding regions of the genome (approximately 1.5%), and DNA sequences, corresponding to the non-coding regions. bdyna incorporates a Siamese network to enhance the analysis of genomic sequences through and similarity comparison. c Illustration of PLLR Computation in dyna for a pair of wild-type and mutated sequences. d The distributions of PLLR values for benign and pathogenic sequences under both the ESM1b and dyna models on cardiomyopathies were compared, showing variations in model performance. e Attribution matrix that visualizes the PLLR scores for all $V \times L$ possible missense variants, where $V$ denotes the vocabulary size and $L$ denotes the protein length. The attribution matrix sequentially displays the PLLR values from the ESM1b model, the dyna model, and their differences, with cytoplasmic domains marked in red and non-cytoplasmic domains in blue.
Figure 2: a The distribution of PLLR values for benign and pathogenic sequences on CM, where a one-sided p-value test confirmed significant differentiation between the pathogenic and benign sequences. b Similarly, we evaluated the PLLR distribution for benign and pathogenic sequences within ARM. c The distributions of PLLR values for benign and pathogenic sequences under both the ESM1b and dyna models on CM were compared, showing variations in model performance. d For ARM, PLLR distributions under the ESM1b and dyna scenarios were analyzed. e AUPR performances on CM for dyna and baselines. f AUPR performances on ARM for dyna and baselines.
Figure 3: a Comparison of AUC scores between dyna and ESM1b for non-overlapping genes with diverse mutation positions in the ClinVar CM dataset. The radar chart illustrates dyna's significantly higher AUC, achieving scores nine times greater than those of ESM1b, along with a detailed representation of the range where dyna shows a sevenfold improvement in the lower AUC range. b Performance comparison using AUPR for dyna and ESM1b, highlighting the results on non-overlapping genes within the ClinVar CM dataset. c Kernel Density Estimate (KDE) plots of the Probabilistic Log-Likelihood Ratio PLLR for overlapping and non-overlapping gene mutations in the ClinVar ARM dataset. d Graphical representation of AUC and AUPR scores (with 1000 bootstrap) for dyna compared to ESM1b across the ClinVar CM and ARM datasets. These figures detail the performance metrics for weighted PLLR evaluation, demonstrating dyna's superior generalization capabilities across all groups. Enhanced performance is particularly noted in non-overlapping genes, validating the model's strength in generalization to unseen disease-relevant genes compared to intra-gene generalization.
Figure 4: a Comparative analysis of MFASS AUPR performances for genomic foundation models, and conventional metrics. b Comparison of AUC w. and w.o. SDVs. c Comparison of AUPR w. and w.o. SDVs. d AUC Comparison between GPN Fine-tuned model with MFASS and Zero-Shot ClinVar Splicing Performance. e AUPR Comparison between GPN Fine-tuned with MFASS and Zero-Shot ClinVar Splicing Performance. f AUC results for the dyna model after only fine-tuning a classification head on the ClinVar Splicing dataset. This figure shows the AUC results after a five-shot fine-tuning phase exclusively on the ClinVar Splicing dataset, following the initial fine-tuning on the MFASS dataset using the GPN model. The AUC achieved is 0.95, demonstrating the model's effective use of prior non-coding VEP knowledge within the dyna framework and its strong out-of-distribution generalization capabilities. g AUPR results for dyna post five-shot fine-tuning on the ClinVar Splicing dataset. h AUC results per disease within the ClinVar Splicing dataset. This figure displays the AUC four specific disease, each with more than five positive and negative variants, highlighting the performance improvements achieved by dyna, fine-tuned using the GPN model, over the baseline GPN model. Retinitis Pigmentosa, Breast and/or Ovarian Cancer, Seizure Disorders, and Hypertrophic Cardiomyopathy are shown due to their known splicing-related pathologies.
Figure 5: AUC and AUPR performances on CM and ARM for ESM2 and dyna fine-tuned ESM2.
...and 11 more figures

DYNA: Disease-Specific Language Model for Variant Pathogenicity

TL;DR

Abstract

DYNA: Disease-Specific Language Model for Variant Pathogenicity

Authors

TL;DR

Abstract

Table of Contents

Figures (16)