Evaluation of network-guided random forest for disease gene discovery

Jianchang Hu; Silke Szymczak

Evaluation of network-guided random forest for disease gene discovery

Jianchang Hu, Silke Szymczak

TL;DR

This study investigates incorporating external gene-network information into random forest training via drift-diffusion-based sampling probabilities to improve disease gene discovery. The network-guided RF uses a directed random walk on the gene network to obtain an equilibrium import distribution pi^*, which biases predictor selection during tree construction; marginal association signals are also integrated in variants like Network-P and Network-Q. Across extensive simulations, network guidance seldom improves overall disease prediction, but it enhances recovery of disease genes when they form modules, while risking spurious hub-driven selections when no real association exists. Validation on TCGA breast cancer PR datasets suggests network-guided RF can reveal genes from PR-related pathways and strengthen module connectivity, though performance depends on network priors and thresholding in gene selection. Overall, the work highlights the potential of network-informed RF for disease-module discovery, while calling for automated, robust variable-selection procedures and broader comparative analyses with alternative network integration strategies.

Abstract

Gene network information is believed to be beneficial for disease module and pathway identification, but has not been explicitly utilized in the standard random forest (RF) algorithm for gene expression data analysis. We investigate the performance of a network-guided RF where the network information is summarized into a sampling probability of predictor variables which is further used in the construction of the RF. Our results suggest that network-guided RF does not provide better disease prediction than the standard RF. In terms of disease gene discovery, if disease genes form module(s), network-guided RF identifies them more accurately. In addition, when disease status is independent from genes in the given network, spurious gene selection results can occur when using network information, especially on hub genes. Our empirical analysis on two balanced microarray and RNA-Seq breast cancer datasets from The Cancer Genome Atlas (TCGA) for classification of progesterone receptor (PR) status also demonstrates that network-guided RF can identify genes from PGR-related pathways, which leads to a better connected module of identified genes.

Evaluation of network-guided random forest for disease gene discovery

TL;DR

Abstract

Paper Structure (17 sections, 3 equations, 7 figures, 2 tables)

This paper contains 17 sections, 3 equations, 7 figures, 2 tables.

Introduction
Materials and methods
Network-guided RF
Selection of important genes
Simulation study
Aim
Data generation
Estimand
Methods to be evaluated
Performance measure
Experimental datasets
Results
Simulation results
Prediction accuracy
Disease gene identification
...and 2 more sections

Figures (7)

Figure 1: Illustration of disease modules with and without main disease gene. The size of the bubble reflects the effect size of the gene. When there is no main disease gene (as shown on the left), all disease genes have the same effect size. When there is a main disease gene (as shown on the right), the effect size of each disease gene is proportional to its closeness to the main disease gene within the module; the closer to the main disease gene, the larger the effect size.
Figure 2: Prediction performance of all methods in all simulation scenarios. The performance is measured by average misclassification rate calculated on the testing set over 100 repetitions. In each scenario, three average effect sizes are considered to represent the cases for weak, median and strong signals. The upper panel gives the results for $p=1000$ total number of genes which is the same as the number of training samples, and the lower panel gives the results for $p=3000$ total number of genes to represent the case of high-dimensional setting.
Figure 3: Number of genes being consistently selected as important genes by each method in the null case. We demonstrate this consistency by counting the number of repetitions within 100 repetitions that a given gene is selected as important genes. The plot shows counts of false selection at several consistency threshold level.
Figure 4: Sensitivity to select disease genes of all methods in all simulation scenarios. The performance is measured by average proportion of disease genes selected as important genes by each method over 100 repetitions. In each scenario, three average effect sizes are considered to represent the cases for weak, median and strong signals. The upper panel gives the results for $p=1000$ total number of genes which is the same as the number of training samples, and the lower panel gives the results for $p=3000$ total number of genes to represent the case of high-dimensional setting.
Figure 5: Common top genes selected by each method on both TCGA breast cancer microarray and RNA-Seq datasets.
...and 2 more figures

Evaluation of network-guided random forest for disease gene discovery

TL;DR

Abstract

Evaluation of network-guided random forest for disease gene discovery

Authors

TL;DR

Abstract

Table of Contents

Figures (7)