Table of Contents
Fetching ...

Machine Learning Algorithm for Noise Reduction and Disease-Causing Gene Feature Extraction in Gene Sequencing Data

Weichen Si, Yihao Ou, Zhen Tian

TL;DR

This work tackles the challenge of sequencing noise impairing genetic-disease analysis by introducing DeepSeqDenoise, a dual-encoder CNN–RNN noise-reduction framework, and a subsequent feature-engineering–driven model to predict pathogenic genes. The approach yields a 9.4 dB SNR improvement and a 94.3% accuracy in disease-gene prediction, validated across cardiovascular disease cohorts and independent datasets, with demonstrated clinical applicability including discovery of novel pathogenic candidates. Key contributions include a 17-feature subset for robust pathogenicity prediction, an integrated XGBoost–RF–DNN ensemble with SMOTE and Bayesian-optimized weights, and transfer learning enhancements that boost generalization to 94.8% accuracy, particularly for rare variants. Collectively, the results show strong potential for improving genetic-disease diagnosis and enabling real-time, precision medicine applications across diverse sequencing platforms.

Abstract

In this study, we propose a machine learning-based method for noise reduction and disease-causing gene feature extraction in gene sequencing DeepSeqDenoise algorithm combines CNN and RNN to effectively remove the sequencing noise, and improves the signal-to-noise ratio by 9.4 dB. We screened 17 key features by feature engineering, and constructed an integrated learning model to predict disease-causing genes with 94.3% accuracy. We successfully identified 57 new candidate disease-causing genes in a cardiovascular disease cohort validation, and detected 3 missed variants in clinical applications. The method significantly outperforms existing tools and provides strong support for accurate diagnosis of genetic diseases.

Machine Learning Algorithm for Noise Reduction and Disease-Causing Gene Feature Extraction in Gene Sequencing Data

TL;DR

This work tackles the challenge of sequencing noise impairing genetic-disease analysis by introducing DeepSeqDenoise, a dual-encoder CNN–RNN noise-reduction framework, and a subsequent feature-engineering–driven model to predict pathogenic genes. The approach yields a 9.4 dB SNR improvement and a 94.3% accuracy in disease-gene prediction, validated across cardiovascular disease cohorts and independent datasets, with demonstrated clinical applicability including discovery of novel pathogenic candidates. Key contributions include a 17-feature subset for robust pathogenicity prediction, an integrated XGBoost–RF–DNN ensemble with SMOTE and Bayesian-optimized weights, and transfer learning enhancements that boost generalization to 94.8% accuracy, particularly for rare variants. Collectively, the results show strong potential for improving genetic-disease diagnosis and enabling real-time, precision medicine applications across diverse sequencing platforms.

Abstract

In this study, we propose a machine learning-based method for noise reduction and disease-causing gene feature extraction in gene sequencing DeepSeqDenoise algorithm combines CNN and RNN to effectively remove the sequencing noise, and improves the signal-to-noise ratio by 9.4 dB. We screened 17 key features by feature engineering, and constructed an integrated learning model to predict disease-causing genes with 94.3% accuracy. We successfully identified 57 new candidate disease-causing genes in a cardiovascular disease cohort validation, and detected 3 missed variants in clinical applications. The method significantly outperforms existing tools and provides strong support for accurate diagnosis of genetic diseases.

Paper Structure

This paper contains 17 sections, 4 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Comparison of mass fraction distribution before and after quality control of sequencing data
  • Figure 2: Architecture of the DeepSeqDenoise algorithm
  • Figure 3: Comparison of the recovery rate of noise reduction algorithms under different noise levels
  • Figure 4: Performance of the prediction model on different types of genetic variants
  • Figure 5: Comparison of the performance of different pathogenic gene prediction methods
  • ...and 1 more figures