Table of Contents
Fetching ...

Supervised Graph Contrastive Learning for Gene Regulatory Networks

Sho Oshima, Yuji Okamoto, Taisei Tosaki, Ryosuke Kojima

TL;DR

SupGCL is a probabilistic formulation that continuously generalizes conventional GCL, linking artificial augmentations with real perturbations measured in knockdown experiments, and using the latter as explicit supervision for GRNs.

Abstract

Graph Contrastive Learning (GCL) is a powerful self-supervised learning framework that performs data augmentation through graph perturbations, with growing applications in the analysis of biological networks such as Gene Regulatory Networks (GRNs). The artificial perturbations commonly used in GCL, such as node dropping, induce structural changes that can diverge from biological reality. This concern has contributed to a broader trend in graph representation learning toward augmentation-free methods, which view such structural changes as problematic and should be avoided. However, this trend overlooks the fundamental insight that structural changes from biologically meaningful perturbations are not a problem to be avoided, but rather a rich source of information, thereby ignoring the valuable opportunity to leverage data from real biological experiments. Motivated by this insight, we propose SupGCL (Supervised Graph Contrastive Learning), a new GCL method for GRNs that directly incorporates biological perturbations from gene knockdown experiments as supervision. SupGCL is a probabilistic formulation that continuously generalizes conventional GCL, linking artificial augmentations with real perturbations measured in knockdown experiments, and using the latter as explicit supervision. On patient-derived GRNs from three cancer types, we train GRN representations with SupGCL and evaluate it in two regimes: (i) embedding space analysis, where it yields clearer disease-subtype structure and improves clustering, and (ii) task-specific fine-tuning, where it consistently outperforms strong graph representation learning baselines on 13 downstream tasks spanning gene-level functional annotation and patient-level prediction.

Supervised Graph Contrastive Learning for Gene Regulatory Networks

TL;DR

SupGCL is a probabilistic formulation that continuously generalizes conventional GCL, linking artificial augmentations with real perturbations measured in knockdown experiments, and using the latter as explicit supervision for GRNs.

Abstract

Graph Contrastive Learning (GCL) is a powerful self-supervised learning framework that performs data augmentation through graph perturbations, with growing applications in the analysis of biological networks such as Gene Regulatory Networks (GRNs). The artificial perturbations commonly used in GCL, such as node dropping, induce structural changes that can diverge from biological reality. This concern has contributed to a broader trend in graph representation learning toward augmentation-free methods, which view such structural changes as problematic and should be avoided. However, this trend overlooks the fundamental insight that structural changes from biologically meaningful perturbations are not a problem to be avoided, but rather a rich source of information, thereby ignoring the valuable opportunity to leverage data from real biological experiments. Motivated by this insight, we propose SupGCL (Supervised Graph Contrastive Learning), a new GCL method for GRNs that directly incorporates biological perturbations from gene knockdown experiments as supervision. SupGCL is a probabilistic formulation that continuously generalizes conventional GCL, linking artificial augmentations with real perturbations measured in knockdown experiments, and using the latter as explicit supervision. On patient-derived GRNs from three cancer types, we train GRN representations with SupGCL and evaluate it in two regimes: (i) embedding space analysis, where it yields clearer disease-subtype structure and improves clustering, and (ii) task-specific fine-tuning, where it consistently outperforms strong graph representation learning baselines on 13 downstream tasks spanning gene-level functional annotation and patient-level prediction.

Paper Structure

This paper contains 84 sections, 6 theorems, 58 equations, 8 figures, 31 tables, 1 algorithm.

Key Result

Theorem 4.1

Assuming $p_\phi(i,j,a,b) = p(i,j)p_\phi(a,b)$, then

Figures (8)

  • Figure 1: Schematic overview of SupGCL. Artificial augmentations are generated by simulating gene knockdowns in a patient GRN, while the teacher GRNs for supervision are derived from real-world knockdown experiments. Embeddings are extracted using a shared GNN, and both node-level and augmentation-level contrastive losses are computed via KL divergence.
  • Figure 2: Overview of downstream tasks. Node-level tasks involve gene classification into Biological Process [ BP.], Cellular Component [ CC.], and cancer relevance [ Rel.]. Graph-level tasks include patient survival prediction [ Hazard] and breast cancer subtyping [ Subtype]. Mean pooling provides graph-level representations.
  • Figure 3: t-SNE visualization of pre-trained graph-level embeddings on breast cancer GRNs. Each point represents the readout feature of an individual patient’s network. NMI and ARI scores indicate quantitative clustering metrics of the embeddings across 5 experimental runs.
  • Figure 4: t-SNE visualization of pre-trained embeddings on breast, lung, and colorectal cancer GRNs.
  • Figure 5: PCA analysis of the latent spaces of pre-trained models.
  • ...and 3 more figures

Theorems & Definitions (11)

  • Theorem 4.1
  • proof
  • Corollary 4.2
  • proof
  • Remark 4.3
  • Proposition 6.1: Sampling Error of SupGCL
  • Proposition 6.2: Error due to Node Sampling
  • Lemma 6.3
  • proof
  • Lemma 6.4
  • ...and 1 more