Machine Learning Algorithm for Noise Reduction and Disease-Causing Gene Feature Extraction in Gene Sequencing Data
Weichen Si, Yihao Ou, Zhen Tian
TL;DR
This work tackles the challenge of sequencing noise impairing genetic-disease analysis by introducing DeepSeqDenoise, a dual-encoder CNN–RNN noise-reduction framework, and a subsequent feature-engineering–driven model to predict pathogenic genes. The approach yields a 9.4 dB SNR improvement and a 94.3% accuracy in disease-gene prediction, validated across cardiovascular disease cohorts and independent datasets, with demonstrated clinical applicability including discovery of novel pathogenic candidates. Key contributions include a 17-feature subset for robust pathogenicity prediction, an integrated XGBoost–RF–DNN ensemble with SMOTE and Bayesian-optimized weights, and transfer learning enhancements that boost generalization to 94.8% accuracy, particularly for rare variants. Collectively, the results show strong potential for improving genetic-disease diagnosis and enabling real-time, precision medicine applications across diverse sequencing platforms.
Abstract
In this study, we propose a machine learning-based method for noise reduction and disease-causing gene feature extraction in gene sequencing DeepSeqDenoise algorithm combines CNN and RNN to effectively remove the sequencing noise, and improves the signal-to-noise ratio by 9.4 dB. We screened 17 key features by feature engineering, and constructed an integrated learning model to predict disease-causing genes with 94.3% accuracy. We successfully identified 57 new candidate disease-causing genes in a cardiovascular disease cohort validation, and detected 3 missed variants in clinical applications. The method significantly outperforms existing tools and provides strong support for accurate diagnosis of genetic diseases.
