Highly Accurate Disease Diagnosis and Highly Reproducible Biomarker Identification with PathFormer

Zehao Dong; Qihang Zhao; Philip R. O. Payne; Michael A Province; Carlos Cruchaga; Muhan Zhang; Tianyu Zhao; Yixin Chen; Fuhai Li

Highly Accurate Disease Diagnosis and Highly Reproducible Biomarker Identification with PathFormer

Zehao Dong, Qihang Zhao, Philip R. O. Payne, Michael A Province, Carlos Cruchaga, Muhan Zhang, Tianyu Zhao, Yixin Chen, Fuhai Li

TL;DR

PathFormer tackles the dual challenge of precise disease diagnosis and robust biomarker identification in omics by integrating signaling-network structure, prior gene-disease knowledge, and expression data. It introduces a Transformer-based PathFormer encoder with a pathway-aware attention mechanism (PAM) and a Knowledge-guided Disease-specific Sortpool (KD-Sortpool) to jointly optimize prediction and disease-specific biomarker selection. The approach yields substantial gains over strong baselines (about a $30\%$ improvement in diagnostic accuracy on benchmark GNNs) and demonstrates high reproducibility of biomarker rankings across independent datasets, including AD Mayo, RosMap, and a cancer cohort, while providing interpretable attention-driven insights. These advances offer a scalable, interpretable framework for omics analyses and biomarker discovery beyond the studied diseases.

Abstract

Biomarker identification is critical for precise disease diagnosis and understanding disease pathogenesis in omics data analysis, like using fold change and regression analysis. Graph neural networks (GNNs) have been the dominant deep learning model for analyzing graph-structured data. However, we found two major limitations of existing GNNs in omics data analysis, i.e., limited-prediction (diagnosis) accuracy and limited-reproducible biomarker identification capacity across multiple datasets. The root of the challenges is the unique graph structure of biological signaling pathways, which consists of a large number of targets and intensive and complex signaling interactions among these targets. To resolve these two challenges, in this study, we presented a novel GNN model architecture, named PathFormer, which systematically integrate signaling network, priori knowledge and omics data to rank biomarkers and predict disease diagnosis. In the comparison results, PathFormer outperformed existing GNN models significantly in terms of highly accurate prediction capability ( 30% accuracy improvement in disease diagnosis compared with existing GNN models) and high reproducibility of biomarker ranking across different datasets. The improvement was confirmed using two independent Alzheimer's Disease (AD) and cancer transcriptomic datasets. The PathFormer model can be directly applied to other omics data analysis studies.

Highly Accurate Disease Diagnosis and Highly Reproducible Biomarker Identification with PathFormer

TL;DR

improvement in diagnostic accuracy on benchmark GNNs) and demonstrates high reproducibility of biomarker rankings across independent datasets, including AD Mayo, RosMap, and a cancer cohort, while providing interpretable attention-driven insights. These advances offer a scalable, interpretable framework for omics analyses and biomarker discovery beyond the studied diseases.

Abstract

Paper Structure (31 sections, 1 theorem, 14 equations, 6 figures, 1 table)

This paper contains 31 sections, 1 theorem, 14 equations, 6 figures, 1 table.

Introduction
Methodology
Overview of the PathFormer Model
Knowledge-guided Disease-specific Sortpool
PathFormer Encoder Layer
Readout Mechanism
Loss Function
Interpretation from PathFormer
Experiments
Datasets and Metrices
Experiment Setup
Highly accurate prediction capability
Highly reproducible biomarker detection
Discussion
Signaling networks.
...and 16 more sections

Key Result

Theorem 4.2

The optimal solution $\hat{X}^{*}$ to the optimization formulation solves the challenge of absence of low-path nature. Let M denote the mask matrix such that $M_{i,j}=1$ if $j \in \mathcal{N}(i)$ and $M_{i,j}=0$ otherwise. Then $MPX$ is the first-order approximation of $\hat{X}^{*}$.

Figures (6)

Figure 1: Overview of proposed framework for gene signaling network analysis with GNNs. Basically, gene interactions and gene expressions are obtained from genomic omics data to formulate gene networks/graphs. Then GNNs are used to perform the prediction task accurately and efficiently, while detecting robust disease-specific gene subset to understand the relation between hub genes and disease phenotypes.
Figure 2: Architecture overview. a. introduces the proposed PathFormer encoder layer. PathFormer encoder layer consists of a Pathway-enhanced Attention Mechanism (PAM) and a subsequent feed-forward network (FFN). Compared to a standard attention mechanism, PAM utilizes the proposed SNPMF (Signaling Network Pathway Modeling Framework) to generate vector embeddings of pathways around each gene, which are then concatenated with gene features to compute the key matrix and query matrix. b. illustrates the overall architecture of the PathFormer model. PathFormer model is composed of a knowledge-guided disease-specific Sortpool (KD-Sortpool) layer and a stack of PathFormer encoder layers. It takes gene network of patients as input and outputs predictions of disease/ phenotype as well as gene subset for biological interpretations.
Figure 3: a Gene networks always have significantly larger graph size and cardinality than popular graphs, which causes severe over-smoothing problem in graph machine learning. b Popular graphs are always treated as signals consist of a low frequent true feature and a high frequency noise. Hence, the low-path nature indicates graph neural networks can be designed to filter out high frequency component to achieve good performance. However, gene networks do not have the low-path property. c PathFormer addresses problems in a and b, thus significantly improving the prediction results over existing state-of-the-art (SOTA) deep learning models.
Figure 4: a PathFormer can control the size of detected gene subset by increasing K in the KD-Sortpool layer, and the detected gene subset expands as K increases. The position of the same gene is shared among these 9 figures. b PathFormer can provide more accurate prediction results when K is increased to keep more genes. c PathFormer detects similar gene subsets for gene-network datasets of the same disease/phenotype, and different gene subsets for datasets of different diseases/phenotypes.
Figure 5: a The KD-Sortpool layer in PathFormer select 100 core genes as a gene subset to explain Alzheimer’s disease. b PathFormer can compute the relation strength (attention) between selected genes. Some genes (e.g. TPP1, PSEN1, CLU, APP) gain significant larger attention from other genes, and these gene usually have a large GDA score and are closed associated with Alzheimer’s disease. c Enrichment analysis on the detected core gene subset finds significant pathways associated with Alzheimer’s disease. d Go term analysis on the detected core gene subset finds significant biological process associated with Alzheimer’s disease.
...and 1 more figures

Theorems & Definitions (2)

Definition 4.1
Theorem 4.2

Highly Accurate Disease Diagnosis and Highly Reproducible Biomarker Identification with PathFormer

TL;DR

Abstract

Highly Accurate Disease Diagnosis and Highly Reproducible Biomarker Identification with PathFormer

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (2)