Table of Contents
Fetching ...

Pruning neural network models for gene regulatory dynamics using data and domain knowledge

Intekhab Hossain, Jonas Fischer, Rebekka Burkholz, John Quackenbush

TL;DR

DASH is proposed, a generalizable framework that guides network pruning by using domain-specific structural information in model fitting and leads to sparser, better interpretable models that are more robust to noise.

Abstract

The practical utility of machine learning models in the sciences often hinges on their interpretability. It is common to assess a model's merit for scientific discovery, and thus novel insights, by how well it aligns with already available domain knowledge--a dimension that is currently largely disregarded in the comparison of neural network models. While pruning can simplify deep neural network architectures and excels in identifying sparse models, as we show in the context of gene regulatory network inference, state-of-the-art techniques struggle with biologically meaningful structure learning. To address this issue, we propose DASH, a generalizable framework that guides network pruning by using domain-specific structural information in model fitting and leads to sparser, better interpretable models that are more robust to noise. Using both synthetic data with ground truth information, as well as real-world gene expression data, we show that DASH, using knowledge about gene interaction partners within the putative regulatory network, outperforms general pruning methods by a large margin and yields deeper insights into the biological systems being studied.

Pruning neural network models for gene regulatory dynamics using data and domain knowledge

TL;DR

DASH is proposed, a generalizable framework that guides network pruning by using domain-specific structural information in model fitting and leads to sparser, better interpretable models that are more robust to noise.

Abstract

The practical utility of machine learning models in the sciences often hinges on their interpretability. It is common to assess a model's merit for scientific discovery, and thus novel insights, by how well it aligns with already available domain knowledge--a dimension that is currently largely disregarded in the comparison of neural network models. While pruning can simplify deep neural network architectures and excels in identifying sparse models, as we show in the context of gene regulatory network inference, state-of-the-art techniques struggle with biologically meaningful structure learning. To address this issue, we propose DASH, a generalizable framework that guides network pruning by using domain-specific structural information in model fitting and leads to sparser, better interpretable models that are more robust to noise. Using both synthetic data with ground truth information, as well as real-world gene expression data, we show that DASH, using knowledge about gene interaction partners within the putative regulatory network, outperforms general pruning methods by a large margin and yields deeper insights into the biological systems being studied.
Paper Structure (54 sections, 18 equations, 7 figures, 10 tables, 1 algorithm)

This paper contains 54 sections, 18 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: DASH. A NN, here a neural ODE for gene regulatory dynamics, is traditionally sparsified in a data-centric way (top). Pruning is done based on data alone, the pruning score $\Omega$ is a function of the learned weights $W$. Such sparsified models often do not learn plausible relationships in the data domain. We propose DASH (bottom), which additionally incorporates domain knowledge $P$ into the pruning score $\Omega$, yielding sparse networks giving meaningful and useful insights into the domain.
  • Figure 2: Results on simulated data. We visualize performance of pruning strategies in comparison to original PHOENIX (baseline) in terms of achieved sparsity (x-axis) and balanced accuracy (y-axis) of the recovered gene regulatory network against the ground truth on the SIM350 data with 5% noise. Error bars are omitted when error is smaller than depicted symbol. $\checkmark$ indicate methods that leverage prior information. Top left is best: recovering true, inherently sparse biological relationships.
  • Figure 3: Reconstruction of ground truth relationships. Estimated effect of gene $g_j$ (x-axis) on the dynamics of gene $g_i$ (y-axis) in SIM350 for different levels of noise (rows). Ground truth is given on the left, our suggested approach and baselines (DASH, BioPrune, and PINN+MP) on the right with mean squared error between inferred regulatory relationships and ground truth in purple.
  • Figure 4: SIM690 data with 5% noise. We visualize performance of pruning strategies in comparison to original PHOENIX (baseline) in terms of achieved sparsity (x-axis) and balanced accuracy (y-axis) of the recovered gene regulatory network against the ground truth. Error bars are omitted when error is smaller than depicted symbol. Checkmarks ($\checkmark$) are used to indicate methods that leverage prior information. Ideal models are in the top left quadrant; they recover the true, inherently sparse biological relationships.
  • Figure 5: BRCA pathway analysis. We visualize the top-20 significant pathways for each method, showing the pathway z-score (x-axis) and indicate significant results after FWER correction (Bonferroni, p-value cutoff at $.05$) with *.
  • ...and 2 more figures