Table of Contents
Fetching ...

Bi-level Contrastive Learning for Knowledge-Enhanced Molecule Representations

Pengcheng Jiang, Cao Xiao, Tianfan Fu, Parminder Bhatia, Taha Kass-Hout, Jimeng Sun, Jiawei Han

TL;DR

Gode tackles the limitation of single-graph molecular representations by jointly pre-training a molecule Graph Neural Network (M-GNN) and a molecule-centric knowledge-graph GNN (K-GNN) on bi-level graphs and aligning them with contrastive learning. It introduces MolKG, a specialized biochemical knowledge graph, and trains M-GNN on molecular graphs while K-GNN learns from $ kappa$-hop KG sub-graphs, followed by InfoNCE-based alignment and downstream fine-tuning with a joint representation. Across 11 chemical-property tasks, Gode delivers substantial improvements, achieving state-of-the-art results in classification and strong gains in regression relative to baselines such as GROVER, MolCLR, and KANO. The approach demonstrates the value of integrating chemical structure and biological knowledge for more accurate, robust molecular property predictions, with potential impact on accelerated drug discovery and knowledge-driven molecular analyses.

Abstract

Molecular representation learning is vital for various downstream applications, including the analysis and prediction of molecular properties and side effects. While Graph Neural Networks (GNNs) have been a popular framework for modeling molecular data, they often struggle to capture the full complexity of molecular representations. In this paper, we introduce a novel method called GODE, which accounts for the dual-level structure inherent in molecules. Molecules possess an intrinsic graph structure and simultaneously function as nodes within a broader molecular knowledge graph. GODE integrates individual molecular graph representations with multi-domain biochemical data from knowledge graphs. By pre-training two GNNs on different graph structures and employing contrastive learning, GODE effectively fuses molecular structures with their corresponding knowledge graph substructures. This fusion yields a more robust and informative representation, enhancing molecular property predictions by leveraging both chemical and biological information. When fine-tuned across 11 chemical property tasks, our model significantly outperforms existing benchmarks, achieving an average ROC-AUC improvement of 12.7% for classification tasks and an average RMSE/MAE improvement of 34.4% for regression tasks. Notably, GODE surpasses the current leading model in property prediction, with advancements of 2.2% in classification and 7.2% in regression tasks.

Bi-level Contrastive Learning for Knowledge-Enhanced Molecule Representations

TL;DR

Gode tackles the limitation of single-graph molecular representations by jointly pre-training a molecule Graph Neural Network (M-GNN) and a molecule-centric knowledge-graph GNN (K-GNN) on bi-level graphs and aligning them with contrastive learning. It introduces MolKG, a specialized biochemical knowledge graph, and trains M-GNN on molecular graphs while K-GNN learns from -hop KG sub-graphs, followed by InfoNCE-based alignment and downstream fine-tuning with a joint representation. Across 11 chemical-property tasks, Gode delivers substantial improvements, achieving state-of-the-art results in classification and strong gains in regression relative to baselines such as GROVER, MolCLR, and KANO. The approach demonstrates the value of integrating chemical structure and biological knowledge for more accurate, robust molecular property predictions, with potential impact on accelerated drug discovery and knowledge-driven molecular analyses.

Abstract

Molecular representation learning is vital for various downstream applications, including the analysis and prediction of molecular properties and side effects. While Graph Neural Networks (GNNs) have been a popular framework for modeling molecular data, they often struggle to capture the full complexity of molecular representations. In this paper, we introduce a novel method called GODE, which accounts for the dual-level structure inherent in molecules. Molecules possess an intrinsic graph structure and simultaneously function as nodes within a broader molecular knowledge graph. GODE integrates individual molecular graph representations with multi-domain biochemical data from knowledge graphs. By pre-training two GNNs on different graph structures and employing contrastive learning, GODE effectively fuses molecular structures with their corresponding knowledge graph substructures. This fusion yields a more robust and informative representation, enhancing molecular property predictions by leveraging both chemical and biological information. When fine-tuned across 11 chemical property tasks, our model significantly outperforms existing benchmarks, achieving an average ROC-AUC improvement of 12.7% for classification tasks and an average RMSE/MAE improvement of 34.4% for regression tasks. Notably, GODE surpasses the current leading model in property prediction, with advancements of 2.2% in classification and 7.2% in regression tasks.
Paper Structure (20 sections, 4 equations, 5 figures, 6 tables)

This paper contains 20 sections, 4 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of our framework Gode. Left: The $\kappa$-hop KG sub-graph consisting of molecule-relevant relational knowledge, originating from a central molecule. Right: We conduct (i) Molecule-level Pre-training on the molecular graphs with contextual property prediction and motif prediction tasks; (ii) KG-level Pre-training on the $\kappa$-hop KG sub-graphs of a central molecule with the tasks of edge prediction, node prediction, and motif prediction; (iii) Contrastive Learning to maximize the agreement between M-GNN and K-GNN, pre-trained by (i) and (ii), respectively; and (iv) Fine-tuning of our learned embedding, optionally enriched with extracted molecular-level features, for specific property predictions.
  • Figure 2: Ablation study configurations and results. (Left) Configurations. "KGE": KG embedding initialization. "$\kappa$": $\kappa$-hop KG subgraph. "Pret.": KG-level pre-training. "Cont.": contrastive learning. "Embedding": input to MLP for fine-tuning. (Right) Performance comparison across different datasets and configurations. We highlight the best configuration for each dataset in red. The dotted blue lines denote the performance achieved by the backbone model (GROVER).
  • Figure 3: t-SNE visualization of molecule embeddings across two tasks. Each color represents a unique scaffold (molecule substructure). We compare the embeddings from GROVER, KANO, and Gode. The clustering quality is assessed using the DB index.
  • Figure 4: Performance of knowledge graph-level pre-training tasks. We report the mean and standard deviation based on five runs with different random seeds.
  • Figure 5: An overview of the difference between Gode with similar works (KGE_NFM by ye2021unified and KANO by fang2023knowledge) leveraging both knowledge graph and molecule. Details such as pre-training strategies or KG embedding initialization are not depicted, for clearer presentations.

Theorems & Definitions (4)

  • Definition 1: Molecule Graph
  • Definition 2: Knowledge Graph
  • Definition 3: M-GNN
  • Definition 4: K-GNN