Table of Contents
Fetching ...

Simplicity within biological complexity

Natasa Przulj, Noel Malod-Dognin

TL;DR

It is proposed to develop a general, comprehensive embedding framework for multi-omic network data, from models to efficient and scalable software implementation, and to apply it to biomedical informatics, focusing on precision medicine and personalized drug discovery.

Abstract

Heterogeneous, interconnected, systems-level, molecular data have become increasingly available and key in precision medicine. We need to utilize them to better stratify patients into risk groups, discover new biomarkers and targets, repurpose known and discover new drugs to personalize medical treatment. Existing methodologies are limited and a paradigm shift is needed to achieve quantitative and qualitative breakthroughs. In this perspective paper, we survey the literature and argue for the development of a comprehensive, general framework for embedding of multi-scale molecular network data that would enable their explainable exploitation in precision medicine in linear time. Network embedding methods map nodes to points in low-dimensional space, so that proximity in the learned space reflects the network's topology-function relationships. They have recently achieved unprecedented performance on hard problems of utilizing few omic data in various biomedical applications. However, research thus far has been limited to special variants of the problems and data, with the performance depending on the underlying topology-function network biology hypotheses, the biomedical applications and evaluation metrics. The availability of multi-omic data, modern graph embedding paradigms and compute power call for a creation and training of efficient, explainable and controllable models, having no potentially dangerous, unexpected behaviour, that make a qualitative breakthrough. We propose to develop a general, comprehensive embedding framework for multi-omic network data, from models to efficient and scalable software implementation, and to apply it to biomedical informatics. It will lead to a paradigm shift in computational and biomedical understanding of data and diseases that will open up ways to solving some of the major bottlenecks in precision medicine and other domains.

Simplicity within biological complexity

TL;DR

It is proposed to develop a general, comprehensive embedding framework for multi-omic network data, from models to efficient and scalable software implementation, and to apply it to biomedical informatics, focusing on precision medicine and personalized drug discovery.

Abstract

Heterogeneous, interconnected, systems-level, molecular data have become increasingly available and key in precision medicine. We need to utilize them to better stratify patients into risk groups, discover new biomarkers and targets, repurpose known and discover new drugs to personalize medical treatment. Existing methodologies are limited and a paradigm shift is needed to achieve quantitative and qualitative breakthroughs. In this perspective paper, we survey the literature and argue for the development of a comprehensive, general framework for embedding of multi-scale molecular network data that would enable their explainable exploitation in precision medicine in linear time. Network embedding methods map nodes to points in low-dimensional space, so that proximity in the learned space reflects the network's topology-function relationships. They have recently achieved unprecedented performance on hard problems of utilizing few omic data in various biomedical applications. However, research thus far has been limited to special variants of the problems and data, with the performance depending on the underlying topology-function network biology hypotheses, the biomedical applications and evaluation metrics. The availability of multi-omic data, modern graph embedding paradigms and compute power call for a creation and training of efficient, explainable and controllable models, having no potentially dangerous, unexpected behaviour, that make a qualitative breakthrough. We propose to develop a general, comprehensive embedding framework for multi-omic network data, from models to efficient and scalable software implementation, and to apply it to biomedical informatics. It will lead to a paradigm shift in computational and biomedical understanding of data and diseases that will open up ways to solving some of the major bottlenecks in precision medicine and other domains.
Paper Structure (19 sections, 2 equations, 4 figures)

This paper contains 19 sections, 2 equations, 4 figures.

Figures (4)

  • Figure 1: Illustration of connectedness of biomedical network data.
  • Figure 2: Illustration of the linear analogy "queen is to women what king is to man".
  • Figure 3: Illustration of the NMTF model from zambrana2021. The viral host interactions, VHIs (represented by matrix $R_{12}$), are simultaneously decomposed with the drug-target interactions, DTIs (represented by matrix $R_{23}$). The matrix factor $G_2$ is shared across decompositions to allow learning from all input matrices. The first graph regularization penalty (illustrated by the green arrow) is added so that the human genes that interact in the molecular interaction network, MIN (represented by its Laplacian matrix, $L_2$), are assigned similar low dimensional embedding vectors in $G_2$. Similarly, the second graph regularization penalty (illustrated by the red arrow) is added so that the drugs that have similar chemical structures in the drug chemical similarity network, DCS (represented by its Laplacian matrix, $L_3$), are assigned similar low dimensional embedding vectors in $G_3$. The hyper-parameters $k_1$, $k_2$ and $k_3$ indicate the reduced dimensions of the embedding spaces of the viral proteins, human proteins and drugs, respectively. The lower dimensional matrix factors, $G_1$, $G_2$, $G_3$, $H_{12}$ and $H_{23}$ are obtained by solving the corresponding minimization problem, $J_1$.
  • Figure 4: Illustration of the NMTF model from mihajlovic2023m. The single-cell gene expression matrix, $E$ (which can be thought of as capturing the phenotype), is simultaneously decomposed with four molecular interaction networks, PPI, GI, COEX and MI, represented by their adjacency matrices, $A_1$, $A_2$, $A_3$, and $A_4$, respectively (which can be thought of as capturing the genotype, as they describe all possible interactions). The matrix factor $G_1$ is shared across decompositions to allow learning from all input matrices. The hyper-parameters $k_1$ and $k_2$ indicate the reduced dimensions of the embedding spaces of human genes and single cells, respectively. The lower dimensional matrix factors, $G_1$, $G_2$, $S_1$, $S_2$, $S_3$, $S_4$ and $S_5$ are obtained by solving the corresponding minimization problem, $J_2$.