Table of Contents
Fetching ...

Graph Representation Learning in Biomedicine

Michelle M. Li, Kexin Huang, Marinka Zitnik

TL;DR

Graph representation learning in biomedicine analyzes how to embed heterogeneous biomedical graphs into compact vectors to support prediction, discovery, and interpretation across molecular, genomic, therapeutic, and healthcare domains. The paper surveys three core families—shallow embedding methods, graph neural networks, and generative graph models—and connects them to long-standing systems biology principles to explain successes and limitations. It highlights applications including predicting molecular interactions, disease mechanisms, drug actions, and patient-level predictions, and discusses scalability, interpretability, and data integration challenges. The work provides a unified framework and roadmap for future graph-based biomedicine research, with emphasis on multi-scale knowledge graphs, spatial and single-cell data, and responsible deployment in clinical settings.

Abstract

Biomedical networks (or graphs) are universal descriptors for systems of interacting elements, from molecular interactions and disease co-morbidity to healthcare systems and scientific knowledge. Advances in artificial intelligence, specifically deep learning, have enabled us to model, analyze, and learn with such networked data. In this review, we put forward an observation that long-standing principles of systems biology and medicine -- while often unspoken in machine learning research -- provide the conceptual grounding for representation learning on graphs, explain its current successes and limitations, and even inform future advancements. We synthesize a spectrum of algorithmic approaches that, at their core, leverage graph topology to embed networks into compact vector spaces. We also capture the breadth of ways in which representation learning has dramatically improved the state-of-the-art in biomedical machine learning. Exemplary domains covered include identifying variants underlying complex traits, disentangling behaviors of single cells and their effects on health, assisting in diagnosis and treatment of patients, and developing safe and effective medicines.

Graph Representation Learning in Biomedicine

TL;DR

Graph representation learning in biomedicine analyzes how to embed heterogeneous biomedical graphs into compact vectors to support prediction, discovery, and interpretation across molecular, genomic, therapeutic, and healthcare domains. The paper surveys three core families—shallow embedding methods, graph neural networks, and generative graph models—and connects them to long-standing systems biology principles to explain successes and limitations. It highlights applications including predicting molecular interactions, disease mechanisms, drug actions, and patient-level predictions, and discusses scalability, interpretability, and data integration challenges. The work provides a unified framework and roadmap for future graph-based biomedicine research, with emphasis on multi-scale knowledge graphs, spatial and single-cell data, and responsible deployment in clinical settings.

Abstract

Biomedical networks (or graphs) are universal descriptors for systems of interacting elements, from molecular interactions and disease co-morbidity to healthcare systems and scientific knowledge. Advances in artificial intelligence, specifically deep learning, have enabled us to model, analyze, and learn with such networked data. In this review, we put forward an observation that long-standing principles of systems biology and medicine -- while often unspoken in machine learning research -- provide the conceptual grounding for representation learning on graphs, explain its current successes and limitations, and even inform future advancements. We synthesize a spectrum of algorithmic approaches that, at their core, leverage graph topology to embed networks into compact vector spaces. We also capture the breadth of ways in which representation learning has dramatically improved the state-of-the-art in biomedical machine learning. Exemplary domains covered include identifying variants underlying complex traits, disentangling behaviors of single cells and their effects on health, assisting in diagnosis and treatment of patients, and developing safe and effective medicines.

Paper Structure

This paper contains 21 sections, 4 figures.

Figures (4)

  • Figure 1: Representation learning for networks in biology and medicine. Given a biomedical network, a representation learning method transforms the graph to extract patterns and leverage them to produce compact vector representations that can be optimized for the downstream task. The far right panel shows a local 2-hop neighborhood around node $u$, illustrating how information (e.g., neural messages) can be propagated along edges in the neighborhood, transformed, and finally aggregated at node $u$ to arrive at the $u$'s embedding.
  • Figure 2: Predominant paradigms in graph representation learning.(a) Shallow network embedding methods generate a dictionary of representations $\mathbf{h}_u$ for every node $u$ that preserves the input graph structure information. This is achieved by learning a mapping function $f_z$ that maps nodes into an embedding space such that nodes with similar graph neighborhoods measured by function $f_n$ get embedded closer together (Section \ref{['sec:shallow']}). Given the learned embeddings, an independent decoder method can optimize embeddings for downstream tasks, such as node or link property prediction. Method examples include DeepWalk perozzi2014deepwalk, Node2vec node2vec, LINE tang2015line, and Metapath2vec metapath2vec. (b) In contrast with shallow network embedding methods, graph neural networks can generate representations for any graph element by capturing both network structure and node attributes and metadata. The embeddings are generated through a series of non-linear transformations, i.e., message-passing layers ($L_k$ denotes transformations at layer $k$), that iteratively aggregate information from neighboring nodes at the target node $u$. GNN models can be optimized for performance on a variety of downstream tasks (Section \ref{['sec:gnn']}). Method examples include GCN gcn, GIN gin, GAT gat, and JK-Net xu2018representation. (c) Generative graph models estimate a distribution landscape $\mathbf{Z}$ to characterize a collection of distinct input graphs. They use the optimized distribution to generate novel graphs $\widehat{G}$ that are predicted to have desirable properties, e.g., a generated graph can be represent a molecular graph of a drug candidate. Generative graph models use graph neural networks as encoders and produce graph representations that capture both network structure and attributes (Section \ref{['sec:generative']}). Method examples include GCPN gcpn, JT-VAE jtvae, and GraphRNN you2018graphrnn. SI Figure 1 and SI Note 3 outline other representation learning techniques.
  • Figure 3: Overview of biomedical applications areas. Networks are prevalent across biomedical areas, from the molecular level to the healthcare systems level. Protein structures and therapeutic compounds can be modeled as a network where nodes represent atoms and edges indicate a bond between pairs of atoms. Protein interaction networks contain nodes that represent proteins and edges that indicate physical interactions (top left). Drug interaction networks are comprised of drug nodes connected by synergistic or antagonistic relationships (bottom left). Protein- and drug-interaction networks can be combined using an edge type that signifies a protein being a "target" of a drug (left). Disease association networks often contain disease nodes with edges representing co-morbidity (middle). Edges exist between proteins and diseases to indicate proteins (or genes) associated with a disease (top middle). Edges exist between drugs and diseases to signify drugs that are indicated for a disease (bottom middle). Patient-specific data, such as medical images (e.g., spatial networks of cells, tumors, and lymph nodes) and EHRs (e.g., networks of medical codes and concepts generated by co-occurrences in patients' records), are often integrated into a cross-domain knowledge graph of proteins, drugs, and diseases (right). With such vast and diverse biomedical networks, we can derive fundamental insights about biology and medicine while enabling personalized representations of patients for precision medicine. Note that there are many other types of edge relations; "targets," "is associated with," "is indicated for," and "has phenotype" are a few examples.
  • Figure 4: Representation learning in four areas of biology and medicine. We present a case study on (a) cell-type aware protein representation learning via multilabel node classification (details in Box \ref{['sec:mol_app']}), (b) disease classification using subgraphs (details in Box \ref{['sec:genome_app']}), (c) cell-line specific prediction of interacting drug pairs via edge regression with transfer learning across cell lines (details in Box \ref{['sec:drug_app']}), and (d) integration of health data into knowledge graphs to predict patient diagnoses or treatments via edge regression (details in Box \ref{['sec:patient_app']}).