Evaluation of network-guided random forest for disease gene discovery
Jianchang Hu, Silke Szymczak
TL;DR
This study investigates incorporating external gene-network information into random forest training via drift-diffusion-based sampling probabilities to improve disease gene discovery. The network-guided RF uses a directed random walk on the gene network to obtain an equilibrium import distribution pi^*, which biases predictor selection during tree construction; marginal association signals are also integrated in variants like Network-P and Network-Q. Across extensive simulations, network guidance seldom improves overall disease prediction, but it enhances recovery of disease genes when they form modules, while risking spurious hub-driven selections when no real association exists. Validation on TCGA breast cancer PR datasets suggests network-guided RF can reveal genes from PR-related pathways and strengthen module connectivity, though performance depends on network priors and thresholding in gene selection. Overall, the work highlights the potential of network-informed RF for disease-module discovery, while calling for automated, robust variable-selection procedures and broader comparative analyses with alternative network integration strategies.
Abstract
Gene network information is believed to be beneficial for disease module and pathway identification, but has not been explicitly utilized in the standard random forest (RF) algorithm for gene expression data analysis. We investigate the performance of a network-guided RF where the network information is summarized into a sampling probability of predictor variables which is further used in the construction of the RF. Our results suggest that network-guided RF does not provide better disease prediction than the standard RF. In terms of disease gene discovery, if disease genes form module(s), network-guided RF identifies them more accurately. In addition, when disease status is independent from genes in the given network, spurious gene selection results can occur when using network information, especially on hub genes. Our empirical analysis on two balanced microarray and RNA-Seq breast cancer datasets from The Cancer Genome Atlas (TCGA) for classification of progesterone receptor (PR) status also demonstrates that network-guided RF can identify genes from PGR-related pathways, which leads to a better connected module of identified genes.
