Predicting loss-of-function impact of genetic mutations: a machine learning approach

Arshmeet Kaur; Morteza Sarmadi

Predicting loss-of-function impact of genetic mutations: a machine learning approach

Arshmeet Kaur, Morteza Sarmadi

TL;DR

This study addresses the need for fast assessment of gene sensitivity to loss-of-function mutations by predicting LoFtool scores from mutation-level attributes. It leverages a ClinVar-derived open dataset, applies extensive preprocessing, and evaluates a suite of machine learning models with univariate feature selection, using five-fold cross-validation and multiple error metrics. Random Forest and XGBoost emerge as the top performers with $R^2$ around $0.97$, while KNN and SVR struggle unless highly correlated features are removed; RANSAC shows robustness to outliers, and regularized target encoding handles high-cardinality features effectively. The results demonstrate the potential to rapidly estimate LoFtool scores from genomic attributes, enabling faster variant prioritization and guiding future work on generalizability with larger, more diverse datasets.

Abstract

The innovation of next-generation sequencing (NGS) techniques has significantly reduced the price of genome sequencing, lowering barriers to future medical research; it is now feasible to apply genome sequencing to studies where it would have previously been cost-inefficient. Identifying damaging or pathogenic mutations in vast amounts of complex, high-dimensional genome sequencing data may be of particular interest to researchers. Thus, this paper's aims were to train machine learning models on the attributes of a genetic mutation to predict LoFtool scores (which measure a gene's intolerance to loss-of-function mutations). These attributes included, but were not limited to, the position of a mutation on a chromosome, changes in amino acids, and changes in codons caused by the mutation. Models were built using the univariate feature selection technique f-regression combined with K-nearest neighbors (KNN), Support Vector Machine (SVM), Random Sample Consensus (RANSAC), Decision Trees, Random Forest, and Extreme Gradient Boosting (XGBoost). These models were evaluated using five-fold cross-validated averages of r-squared, mean squared error, root mean squared error, mean absolute error, and explained variance. The findings of this study include the training of multiple models with testing set r-squared values of 0.97.

Predicting loss-of-function impact of genetic mutations: a machine learning approach

TL;DR

around

, while KNN and SVR struggle unless highly correlated features are removed; RANSAC shows robustness to outliers, and regularized target encoding handles high-cardinality features effectively. The results demonstrate the potential to rapidly estimate LoFtool scores from genomic attributes, enabling faster variant prioritization and guiding future work on generalizability with larger, more diverse datasets.

Abstract

Paper Structure (10 sections, 2 figures, 4 tables)

This paper contains 10 sections, 2 figures, 4 tables.

Introduction
Methods
Original Dataset
Data Preprocessing
Addressing Missing Values and Encoding Categorical Variables
Visualizing Relationships Within in the Final Dataset
Visualizing Skew and Transforming Data
Feature Selection
Model Selection
Results and Discussion

Figures (2)

Figure 1: Correlation Matrix: POS, cDNA position, CDS position, Protein position, CHROM, SYMBOL, Feature, and EXON are correlated with LoFtool. As can be seen above, several of these variables were highly correlated with each other (e.g. cDNA position, CDS position, and Protein position). These variables were kept in mind to drop or add when testing machine learning models. More details are given in Tables II, III, and IV.
Figure 2: Distribution of continuous variables before and after applying transformations. As can be seen in the plots, there were still many outliers left after both transformations. The log transformation normalized allele frequency columns more than the Yeo-Johnson transformation.

Predicting loss-of-function impact of genetic mutations: a machine learning approach

TL;DR

Abstract

Predicting loss-of-function impact of genetic mutations: a machine learning approach

Authors

TL;DR

Abstract

Table of Contents

Figures (2)