Table of Contents
Fetching ...

Integrating Large Language Models for Genetic Variant Classification

Youssef Boulaimen, Gabriele Fossi, Leila Outemzabet, Nathalie Jeanray, Oleksandr Levenets, Stephane Gerart, Sebastien Vachenc, Salvatore Raieli, Joanna Giemza

TL;DR

This study investigates the integration of state-of-the-art LLMs, including GPN-MSA, ESM1b, and AlphaMissense, which leverage DNA and protein sequence data alongside structural insights to form a comprehensive analytical framework for variant classification, and evaluates these integrated models using the well-annotated ProteinGym and ClinVar datasets.

Abstract

The classification of genetic variants, particularly Variants of Uncertain Significance (VUS), poses a significant challenge in clinical genetics and precision medicine. Large Language Models (LLMs) have emerged as transformative tools in this realm. These models can uncover intricate patterns and predictive insights that traditional methods might miss, thus enhancing the predictive accuracy of genetic variant pathogenicity. This study investigates the integration of state-of-the-art LLMs, including GPN-MSA, ESM1b, and AlphaMissense, which leverage DNA and protein sequence data alongside structural insights to form a comprehensive analytical framework for variant classification. Our approach evaluates these integrated models using the well-annotated ProteinGym and ClinVar datasets, setting new benchmarks in classification performance. The models were rigorously tested on a set of challenging variants, demonstrating substantial improvements over existing state-of-the-art tools, especially in handling ambiguous and clinically uncertain variants. The results of this research underline the efficacy of combining multiple modeling approaches to significantly refine the accuracy and reliability of genetic variant classification systems. These findings support the deployment of these advanced computational models in clinical environments, where they can significantly enhance the diagnostic processes for genetic disorders, ultimately pushing the boundaries of personalized medicine by offering more detailed and actionable genetic insights.

Integrating Large Language Models for Genetic Variant Classification

TL;DR

This study investigates the integration of state-of-the-art LLMs, including GPN-MSA, ESM1b, and AlphaMissense, which leverage DNA and protein sequence data alongside structural insights to form a comprehensive analytical framework for variant classification, and evaluates these integrated models using the well-annotated ProteinGym and ClinVar datasets.

Abstract

The classification of genetic variants, particularly Variants of Uncertain Significance (VUS), poses a significant challenge in clinical genetics and precision medicine. Large Language Models (LLMs) have emerged as transformative tools in this realm. These models can uncover intricate patterns and predictive insights that traditional methods might miss, thus enhancing the predictive accuracy of genetic variant pathogenicity. This study investigates the integration of state-of-the-art LLMs, including GPN-MSA, ESM1b, and AlphaMissense, which leverage DNA and protein sequence data alongside structural insights to form a comprehensive analytical framework for variant classification. Our approach evaluates these integrated models using the well-annotated ProteinGym and ClinVar datasets, setting new benchmarks in classification performance. The models were rigorously tested on a set of challenging variants, demonstrating substantial improvements over existing state-of-the-art tools, especially in handling ambiguous and clinically uncertain variants. The results of this research underline the efficacy of combining multiple modeling approaches to significantly refine the accuracy and reliability of genetic variant classification systems. These findings support the deployment of these advanced computational models in clinical environments, where they can significantly enhance the diagnostic processes for genetic disorders, ultimately pushing the boundaries of personalized medicine by offering more detailed and actionable genetic insights.

Paper Structure

This paper contains 20 sections, 7 figures.

Figures (7)

  • Figure 1: Dataset presentation: a. Dataset Samples: This table provides a representative sample of the dataset utilized for this study, showcasing both observed and potential mutation scores derived from three distinct models: GPN scores, ESM scores, and AlphaMissense scores. Each entry represents the score assigned by the respective model to various attributes such as nucleotide changes (A, T, G, C) and amino acid substitutions (e.g., Ala, Cys, Tyr), as well as the observed scores in columns Alt_score, Prot_Alt_score, and Am_pathogenicity. Nucleotides and Amino Acids with a score of value 0 correspond to the reference allele or protein. The "DMS_bin_score" column indicates the clinical classification of the mutation as either "pathogenic" or "benign." b. Dataset Visualization: This plot represents the distribution of the variants’ observed scores of GPN-MSA as Alt Score on the X-axis and ESM1b as Prot Alt Score on the Y-axis. Red data points represent the variants classified as pathogenic by DMS_bin_score whereas the black data points are classified as benign. c. Optimal Threshold for GPN: This ROC curve illustrates the performance of the GPN model in discriminating between pathogenic and benign genetic variants. The x-axis represents the False Positive Rate (FPR), and the y-axis represents the True Positive Rate (TPR), across various threshold levels. The orange line depicts the actual ROC curve, which shows how the TPR and FPR change with different thresholds. The area under the curve (AUC) is 0.70, indicating the model's overall ability to distinguish between the classes; a value of 1.0 represents a perfect classifier, and a value of 0.5 represents a random guess. The dashed blue line represents the line of no discrimination, which serves as a baseline comparison. The optimal threshold for classification is found by maximizing the difference between TPR and FPR.
  • Figure 2: Correlation Analysis This table illustrates the correlation matrix between the observed scores from the GPN-MSA, ESM1b, and AlphaMissense models. The correlation coefficients quantify the degree to which these models agree or disagree on the pathogenic potential of the mutations, providing insight into their comparative analytical behaviors.
  • Figure 3: Comparative performance of Machine Learning models in genetic variant classification: This table presents the benchmarking results of different machine learning models using various combinations of features derived from GPN-MSA, ESM1b, and AlphaMissense. The models evaluated include multi-input neural networks, single-input neural networks, XGBoost, and Random Forest, each tested across four distinct feature sets: GPN+ESM potential scores, GPN+ESM+AlphaMissense potential scores, observed scores from GPN+ESM+AlphaMissense, and a combination of observed and potential scores from GPN+ESM+AlphaMissense.
  • Figure 4: Model Performance Across Different Conditions and Datasets:(a): This panel displays the accuracy of individual models on the DMS_bin_score. The bars represent the accuracy of the AlphaMissense model, the integrated Model (combining GPN-MSA, ESM1b, and AlphaMissense), ESM1b, and GPN. Here, the integrated Model shows a strong performance with an accuracy of 82.54%, followed by AlphaMissense at 74.58%. ESM1b and GPN exhibit lower accuracies at 73.84% and 67.03% respectively. (b): This panel illustrates the model accuracy after removing variants classified as ambiguous by AlphaMissense. This graph provides insights into how the clarity of variant classification affects model performance. The integrated Model achieves the highest accuracy at 84.51%, followed by AlphaMissense at 83.51%, and ESM1b at 75.41%. GPN shows significantly lower accuracy at 69.26%, suggesting it is more affected by the removal of ambiguous variants compared to the other models. (c): This panel focuses specifically on the accuracy of models in predicting AlphaMissense classified ambiguous variants, highlighting the challenges in handling ambiguous genomic data. The integrated Model maintains the highest accuracy at 66.07%, demonstrating its robustness even in uncertain conditions. ESM1b and GPN show reduced accuracies at 60.69% and 48.38% respectively.
  • Figure 5: Model Performance on ClinVar Dataset:(a): This panel displays the accuracy of AlphaMissense, the integrated Model, ESM1b, and GPN when tested against the ClinVar dataset. The ClinVar dataset provides three different classes: Pathogenic, Benign, and Ambiguous. The integrated Model shows the highest accuracy at 79.16%, followed by AlphaMissense at 72.07%, ESM1b at 70.93%, and GPN showing the lowest accuracy at 64.33%. (b): This panel illustrates the accuracy of the same models on the ClinVar dataset after the removal of the 715 ambiguous variants. The performance of all models is generally improved. The integrated Model leads in accuracy at 82.82%, demonstrating its effectiveness in classifying clearly defined genetic variants. This is followed by AlphaMissense at 74.87%, ESM1b at 74.21%, and GPN at 67.31%. (c): This panel focuses on variants classified as ambiguous in the ClinVar dataset while using DMS as ground truth. The graph illustrates that the integrated Model maintains superior performance even in this subset, achieving an accuracy of 76.22%. AlphaMissense follows at 68.11%, ESM1b at 65.59%, and GPN shows the least accuracy at 60.70%. (d): This panel examines the accuracy of the models on a combined subset of variants classified as ambiguous by both ClinVar and AlphaMissense, with DMS used as ground truth for performance evaluation. The integrated Model continues to show superior performance in this challenging scenario with an accuracy of 60.24%. This is followed by ESM1b at 55.42%, and GPN at 46.99%, further validating the robustness of the integrated Model.
  • ...and 2 more figures