Table of Contents
Fetching ...

Subset-Contrastive Multi-Omics Network Embedding

Pedro Henrique da Costa Avelar, Min Wu, Sophia Tsoka

TL;DR

SCONE tackles the memory and scalability challenges of graph-based multi-omics analyses by introducing subset-contrastive learning on two overlapping subset views. Each omic view is learned with GAT-based encoders on KNN graphs, and a shared latent space is formed through pooling and reconstruction, while subset-contrastive losses align overlapping samples and decay misalignment across non-overlapping ones. The approach yields competitive or superior clustering and survival-significance results across single-cell and bulk multi-omics datasets, while reducing memory demands relative to full-graph methods. Overall, SCONE demonstrates scalable, synergistic integration of heterogeneous omics layers with potential applicability to spatial transcriptomics and other large-scale multi-omics tasks.

Abstract

Motivation: Network-based analyses of omics data are widely used, and while many of these methods have been adapted to single-cell scenarios, they often remain memory- and space-intensive. As a result, they are better suited to batch data or smaller datasets. Furthermore, the application of network-based methods in multi-omics often relies on similarity-based networks, which lack structurally-discrete topologies. This limitation may reduce the effectiveness of graph-based methods that were initially designed for topologies with better defined structures. Results: We propose Subset-Contrastive multi-Omics Network Embedding (SCONE), a method that employs contrastive learning techniques on large datasets through a scalable subgraph contrastive approach. By exploiting the pairwise similarity basis of many network-based omics methods, we transformed this characteristic into a strength, developing an approach that aims to achieve scalable and effective analysis. Our method demonstrates synergistic omics integration for cell type clustering in single-cell data. Additionally, we evaluate its performance in a bulk multi-omics integration scenario, where SCONE performs comparable to the state-of-the-art despite utilising limited views of the original data. We anticipate that our findings will motivate further research into the use of subset contrastive methods for omics data.

Subset-Contrastive Multi-Omics Network Embedding

TL;DR

SCONE tackles the memory and scalability challenges of graph-based multi-omics analyses by introducing subset-contrastive learning on two overlapping subset views. Each omic view is learned with GAT-based encoders on KNN graphs, and a shared latent space is formed through pooling and reconstruction, while subset-contrastive losses align overlapping samples and decay misalignment across non-overlapping ones. The approach yields competitive or superior clustering and survival-significance results across single-cell and bulk multi-omics datasets, while reducing memory demands relative to full-graph methods. Overall, SCONE demonstrates scalable, synergistic integration of heterogeneous omics layers with potential applicability to spatial transcriptomics and other large-scale multi-omics tasks.

Abstract

Motivation: Network-based analyses of omics data are widely used, and while many of these methods have been adapted to single-cell scenarios, they often remain memory- and space-intensive. As a result, they are better suited to batch data or smaller datasets. Furthermore, the application of network-based methods in multi-omics often relies on similarity-based networks, which lack structurally-discrete topologies. This limitation may reduce the effectiveness of graph-based methods that were initially designed for topologies with better defined structures. Results: We propose Subset-Contrastive multi-Omics Network Embedding (SCONE), a method that employs contrastive learning techniques on large datasets through a scalable subgraph contrastive approach. By exploiting the pairwise similarity basis of many network-based omics methods, we transformed this characteristic into a strength, developing an approach that aims to achieve scalable and effective analysis. Our method demonstrates synergistic omics integration for cell type clustering in single-cell data. Additionally, we evaluate its performance in a bulk multi-omics integration scenario, where SCONE performs comparable to the state-of-the-art despite utilising limited views of the original data. We anticipate that our findings will motivate further research into the use of subset contrastive methods for omics data.

Paper Structure

This paper contains 15 sections, 1 theorem, 14 equations, 9 figures, 4 tables.

Key Result

Theorem 1

For a sampling rate $k_{s} \leq \frac{n}{2}$ and linear or superlinear operations, the memory required to store the values is at most a constant factor greater than that required for the full graph with $n$ nodes.

Figures (9)

  • Figure 1: A schematic representation of the proposed SCONE model. The model begins by subsetting the original dataset, sampling $k_{s}$ samples for each subset in each omics layer, resulting in subset 1 and subset 2. For each subset, a K-nearest neighbors graph is constructed for each omics layer. The omics measurements and graph of each subset are then input into an omics-specific GAT encoder $E_{i}$ to produce the latent representation $z_{i} = E_{i}(x_{i}, G_{i})$. Subset-specific representations are denoted as $z^{\prime}{i}$ for subset 1 and $z^{\prime\prime}{i}$ for subset 2. The outputs from the encoder layers are combined into a shared representation $z$, which is subsequently passed through the omics-specific GAT decoder $D_{i}$ to reconstruct each omics layer for each subset. We optimise the model through the reconstruction loss $\mathcal{L}_{\text{rec}_{i}}$. Simultaneously, the overlapping nodes between subsets are leveraged for the contrastive loss $\mathcal{L}_{C_{i}}$, which utilizes the encoder outputs and neighborhood information for both overlapping and non-overlapping nodes. Components with a orange colour show the trajectory under subset 1, equally the green colour represents subset 2. The components between the encoders and decoders in yellow show the overlap between both sets, which is used for our subset contrastive loss. The heatmaps shown correspond to the kotliarov_broad_2020 dataset, with omics layer 1 being the RNA data and Omics layer o being CITE-seq data. For readability we reduce the number of samples, RNA probes, and CITE-seq probes by a factor of roughly 101, 10, and 2, respectively, with subsets further reducing the number of samples by approximately 10.
  • Figure 2: A t-SNE maaten_visualizing_2008 visualisation of one of our multi-omics model's learned latent representation of the kotliarov_broad_2020 dataset. On the top left, we overlay the two batches defined by lotfollahi_multigrate_2021 which correspond to low- and high- vaccine responders. On the top right we overlay the broad cell type ground-truth defined by kotliarov_broad_2020 through analysis of the surface proteins in their dataset, which we used as our ground-truth for comparisons. In the bottom we provide the fine cell-type annotation provided by kotliarov_broad_2020 to showcase how our model separates these subtypes within the broader cell-types.
  • Figure 3: A t-SNE maaten_visualizing_2008 visualisation of one of our multi-omics model's learned latent representation of the kotliarov_broad_2020 dataset. We overlay the average of the $\log_{10}$ surface-protein expression for the marker proteins of each cell type defined by kotliarov_broad_2020. From left-to-right, top-to-bottom we have the average surface-protein expression for T-cells, Monocytes, Natural Killer (NK) cells, plasmacytoid Dendritic Cells (pDC), Hematopoietic stem cells (HSC), and B-cells. A more detailed per-marker-protein view can be seen in Supplementary Figures \ref{['sup:fig:marker-t']}, \ref{['sup:fig:marker-monocyte']}, \ref{['sup:fig:marker-nk']}, \ref{['sup:fig:marker-pdc']}, \ref{['sup:fig:marker-hsc']}, and \ref{['sup:fig:marker-b']}, which can further show how our integrated topology still retains the crucial surface protein information.
  • Figure S1: A t-SNE maaten_visualizing_2008 visualisation of one of our multi-omics model's learned latent representation of the kotliarov_broad_2020 dataset. We overlay the $\log_{10}$ surface-protein expression for the marker proteins of T-cells defined by kotliarov_broad_2020.
  • Figure S2: A t-SNE maaten_visualizing_2008 visualisation of one of our multi-omics model's learned latent representation of the kotliarov_broad_2020 dataset. We overlay the $\log_{10}$ surface-protein expression for the marker proteins of Monocytes defined by kotliarov_broad_2020.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof