Highly Accurate Disease Diagnosis and Highly Reproducible Biomarker Identification with PathFormer
Zehao Dong, Qihang Zhao, Philip R. O. Payne, Michael A Province, Carlos Cruchaga, Muhan Zhang, Tianyu Zhao, Yixin Chen, Fuhai Li
TL;DR
PathFormer tackles the dual challenge of precise disease diagnosis and robust biomarker identification in omics by integrating signaling-network structure, prior gene-disease knowledge, and expression data. It introduces a Transformer-based PathFormer encoder with a pathway-aware attention mechanism (PAM) and a Knowledge-guided Disease-specific Sortpool (KD-Sortpool) to jointly optimize prediction and disease-specific biomarker selection. The approach yields substantial gains over strong baselines (about a $30\%$ improvement in diagnostic accuracy on benchmark GNNs) and demonstrates high reproducibility of biomarker rankings across independent datasets, including AD Mayo, RosMap, and a cancer cohort, while providing interpretable attention-driven insights. These advances offer a scalable, interpretable framework for omics analyses and biomarker discovery beyond the studied diseases.
Abstract
Biomarker identification is critical for precise disease diagnosis and understanding disease pathogenesis in omics data analysis, like using fold change and regression analysis. Graph neural networks (GNNs) have been the dominant deep learning model for analyzing graph-structured data. However, we found two major limitations of existing GNNs in omics data analysis, i.e., limited-prediction (diagnosis) accuracy and limited-reproducible biomarker identification capacity across multiple datasets. The root of the challenges is the unique graph structure of biological signaling pathways, which consists of a large number of targets and intensive and complex signaling interactions among these targets. To resolve these two challenges, in this study, we presented a novel GNN model architecture, named PathFormer, which systematically integrate signaling network, priori knowledge and omics data to rank biomarkers and predict disease diagnosis. In the comparison results, PathFormer outperformed existing GNN models significantly in terms of highly accurate prediction capability ( 30% accuracy improvement in disease diagnosis compared with existing GNN models) and high reproducibility of biomarker ranking across different datasets. The improvement was confirmed using two independent Alzheimer's Disease (AD) and cancer transcriptomic datasets. The PathFormer model can be directly applied to other omics data analysis studies.
