Table of Contents
Fetching ...

Product Manifold Representations for Learning on Biological Pathways

Daniel McNeela, Frederic Sala, Anthony Gitter

TL;DR

Biological pathway graphs present complex topologies that challenge Euclidean embeddings. The authors propose learning in mixed-curvature product manifolds and a Product GCN to capture spherical, hyperbolic, and Euclidean components, with distances decomposing across components. They demonstrate substantial distortion reductions and improved in-distribution edge prediction, while out-of-distribution PPI edges reveal robustness challenges and potential overfitting to training topology. The work provides open-source code and highlights both the promise and limitations of non-Euclidean representations for pathway analysis and predictive modeling.

Abstract

Machine learning models that embed graphs in non-Euclidean spaces have shown substantial benefits in a variety of contexts, but their application has not been studied extensively in the biological domain, particularly with respect to biological pathway graphs. Such graphs exhibit a variety of complex network structures, presenting challenges to existing embedding approaches. Learning high-quality embeddings for biological pathway graphs is important for researchers looking to understand the underpinnings of disease and train high-quality predictive models on these networks. In this work, we investigate the effects of embedding pathway graphs in non-Euclidean mixed-curvature spaces and compare against traditional Euclidean graph representation learning models. We then train a supervised model using the learned node embeddings to predict missing protein-protein interactions in pathway graphs. We find large reductions in distortion and boosts on in-distribution edge prediction performance as a result of using mixed-curvature embeddings and their corresponding graph neural network models. However, we find that mixed-curvature representations underperform existing baselines on out-of-distribution edge prediction performance suggesting that these representations may overfit to the training graph topology. We provide our Mixed-Curvature Product Graph Convolutional Network code at https://github.com/mcneela/Mixed-Curvature-GCN and our pathway analysis code at https://github.com/mcneela/Mixed-Curvature-Pathways.

Product Manifold Representations for Learning on Biological Pathways

TL;DR

Biological pathway graphs present complex topologies that challenge Euclidean embeddings. The authors propose learning in mixed-curvature product manifolds and a Product GCN to capture spherical, hyperbolic, and Euclidean components, with distances decomposing across components. They demonstrate substantial distortion reductions and improved in-distribution edge prediction, while out-of-distribution PPI edges reveal robustness challenges and potential overfitting to training topology. The work provides open-source code and highlights both the promise and limitations of non-Euclidean representations for pathway analysis and predictive modeling.

Abstract

Machine learning models that embed graphs in non-Euclidean spaces have shown substantial benefits in a variety of contexts, but their application has not been studied extensively in the biological domain, particularly with respect to biological pathway graphs. Such graphs exhibit a variety of complex network structures, presenting challenges to existing embedding approaches. Learning high-quality embeddings for biological pathway graphs is important for researchers looking to understand the underpinnings of disease and train high-quality predictive models on these networks. In this work, we investigate the effects of embedding pathway graphs in non-Euclidean mixed-curvature spaces and compare against traditional Euclidean graph representation learning models. We then train a supervised model using the learned node embeddings to predict missing protein-protein interactions in pathway graphs. We find large reductions in distortion and boosts on in-distribution edge prediction performance as a result of using mixed-curvature embeddings and their corresponding graph neural network models. However, we find that mixed-curvature representations underperform existing baselines on out-of-distribution edge prediction performance suggesting that these representations may overfit to the training graph topology. We provide our Mixed-Curvature Product Graph Convolutional Network code at https://github.com/mcneela/Mixed-Curvature-GCN and our pathway analysis code at https://github.com/mcneela/Mixed-Curvature-Pathways.
Paper Structure (30 sections, 10 equations, 19 figures, 3 tables)

This paper contains 30 sections, 10 equations, 19 figures, 3 tables.

Figures (19)

  • Figure 1: Histograms of node and edge distributions for the pathway databases studied. A few outliers were excluded.
  • Figure 2: Best overall mixed-curvature versus best Euclidean distortions across all pathway datasets (a). See Figure \ref{['fig:distortion-scatter-apdx']} for individual pathway datasets. Distortions across all datasets for the graph Laplacian, node2vec, and mixed-curvature embedding methods (b).
  • Figure 3: Edge prediction performance on all five pathway datasets for all models as given by four metrics: Validation Set AP, Validation Set AUROC, Test Set AP, and Test Set AUROC. Test set edges come from STRING.
  • Figure 4: Scatterplots of distortion in the Euclidean embedding versus distortion in the mixed-curvature embedding for pathway datasets. Points are colored by local density, with yellow indicating the highest density.
  • Figure 5: Comparison of Euclidean GCN initialized with pretrained Euclidean embeddings and Product GCN performance on in-distribution validation set and out-of-distribution test set. Each density plot shows one of either AP or AUROC metrics taken across all graphs in the PathBank dataset.
  • ...and 14 more figures

Theorems & Definitions (1)

  • Definition 3.1