Table of Contents
Fetching ...

Genotype-Phenotype Integration through Machine Learning and Personalized Gene Regulatory Networks for Cancer Metastasis Prediction

Jiwei Fu, Chunyu Yang

TL;DR

Metastasis prediction remains challenging across cancer types and resource settings. The authors combine traditional ML benchmarks on CCLE expression data with personalized gene regulatory networks generated via PANDA and LIONESS, feeding these into a Graph Attention Network v2 to capture patient-specific regulatory patterns. XGBoost achieved the strongest performance (AUROC ≈ 0.705), while the GNN reached AUROC ≈ 0.642, illustrating complementary strengths and the limited topology signal in this dataset. The framework demonstrates feasibility for low-cost pancancer screening and provides a dual population- and patient-level approach to precision oncology that can guide resource allocation and future multiomics integration.

Abstract

Metastasis is the leading cause of cancer-related mortality, yet most predictive models rely on shallow architectures and neglect patient-specific regulatory mechanisms. Here, we integrate classical machine learning and deep learning to predict metastatic potential across multiple cancer types. Gene expression profiles from the Cancer Cell Line Encyclopedia were combined with a transcription factor-target prior from DoRothEA, focusing on nine metastasis-associated regulators. After selecting differential genes using the Kruskal-Wallis test, ElasticNet, Random Forest, and XGBoost models were trained for benchmarking. Personalized gene regulatory networks were then constructed using PANDA and LIONESS and analyzed through a graph attention neural network (GATv2) to learn topological and expression-based representations. While XGBoost achieved the highest AUROC (0.7051), the GNN captured non-linear regulatory dependencies at the patient level. These results demonstrate that combining traditional machine learning with graph-based deep learning enables a scalable and interpretable framework for metastasis risk prediction in precision oncology.

Genotype-Phenotype Integration through Machine Learning and Personalized Gene Regulatory Networks for Cancer Metastasis Prediction

TL;DR

Metastasis prediction remains challenging across cancer types and resource settings. The authors combine traditional ML benchmarks on CCLE expression data with personalized gene regulatory networks generated via PANDA and LIONESS, feeding these into a Graph Attention Network v2 to capture patient-specific regulatory patterns. XGBoost achieved the strongest performance (AUROC ≈ 0.705), while the GNN reached AUROC ≈ 0.642, illustrating complementary strengths and the limited topology signal in this dataset. The framework demonstrates feasibility for low-cost pancancer screening and provides a dual population- and patient-level approach to precision oncology that can guide resource allocation and future multiomics integration.

Abstract

Metastasis is the leading cause of cancer-related mortality, yet most predictive models rely on shallow architectures and neglect patient-specific regulatory mechanisms. Here, we integrate classical machine learning and deep learning to predict metastatic potential across multiple cancer types. Gene expression profiles from the Cancer Cell Line Encyclopedia were combined with a transcription factor-target prior from DoRothEA, focusing on nine metastasis-associated regulators. After selecting differential genes using the Kruskal-Wallis test, ElasticNet, Random Forest, and XGBoost models were trained for benchmarking. Personalized gene regulatory networks were then constructed using PANDA and LIONESS and analyzed through a graph attention neural network (GATv2) to learn topological and expression-based representations. While XGBoost achieved the highest AUROC (0.7051), the GNN captured non-linear regulatory dependencies at the patient level. These results demonstrate that combining traditional machine learning with graph-based deep learning enables a scalable and interpretable framework for metastasis risk prediction in precision oncology.

Paper Structure

This paper contains 37 sections, 1 equation, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Sample count by class and class distribution. The bar plot (top) and pie chart (bottom) show that the dataset contains 926 primary samples (63.7%), 515 metastatic samples (35.4%), 1 recurrent sample (0.1%), and 11 samples with unknown classification (0.8%).
  • Figure 2: Workflow of the proposed analysis framework. Gene expression profiles were analyzed using two complementary approaches: (1) expression-only machine learning models (XGBoost, ElasticNet, Random Forest) for metastasis classification, serving both as benchmarks for GNN performance and for assessing feasibility in low-resource settings; (2) Graph neural networks are trained on personalized gene regulatory networks. These networks are generated by integrating TF-target data with expression data using the PANDA and LIONESS algorithms.
  • Figure 3: Overview of the LIONESS framework. Figure adapted from Kuijjer et al. (2019)lioness.
  • Figure 4: Results section overview.
  • Figure 5: Volcano plot of genome-wide differential expression. Volcano plot comparing metastatic and primary samples. Most genes reached statistical significance (p $<$ 0.05), with subsets showing consistent upregulation or downregulation in metastatic samples, suggesting the presence of systematic expression differences.
  • ...and 9 more figures