Bi-level Contrastive Learning for Knowledge-Enhanced Molecule Representations
Pengcheng Jiang, Cao Xiao, Tianfan Fu, Parminder Bhatia, Taha Kass-Hout, Jimeng Sun, Jiawei Han
TL;DR
Gode tackles the limitation of single-graph molecular representations by jointly pre-training a molecule Graph Neural Network (M-GNN) and a molecule-centric knowledge-graph GNN (K-GNN) on bi-level graphs and aligning them with contrastive learning. It introduces MolKG, a specialized biochemical knowledge graph, and trains M-GNN on molecular graphs while K-GNN learns from $kappa$-hop KG sub-graphs, followed by InfoNCE-based alignment and downstream fine-tuning with a joint representation. Across 11 chemical-property tasks, Gode delivers substantial improvements, achieving state-of-the-art results in classification and strong gains in regression relative to baselines such as GROVER, MolCLR, and KANO. The approach demonstrates the value of integrating chemical structure and biological knowledge for more accurate, robust molecular property predictions, with potential impact on accelerated drug discovery and knowledge-driven molecular analyses.
Abstract
Molecular representation learning is vital for various downstream applications, including the analysis and prediction of molecular properties and side effects. While Graph Neural Networks (GNNs) have been a popular framework for modeling molecular data, they often struggle to capture the full complexity of molecular representations. In this paper, we introduce a novel method called GODE, which accounts for the dual-level structure inherent in molecules. Molecules possess an intrinsic graph structure and simultaneously function as nodes within a broader molecular knowledge graph. GODE integrates individual molecular graph representations with multi-domain biochemical data from knowledge graphs. By pre-training two GNNs on different graph structures and employing contrastive learning, GODE effectively fuses molecular structures with their corresponding knowledge graph substructures. This fusion yields a more robust and informative representation, enhancing molecular property predictions by leveraging both chemical and biological information. When fine-tuned across 11 chemical property tasks, our model significantly outperforms existing benchmarks, achieving an average ROC-AUC improvement of 12.7% for classification tasks and an average RMSE/MAE improvement of 34.4% for regression tasks. Notably, GODE surpasses the current leading model in property prediction, with advancements of 2.2% in classification and 7.2% in regression tasks.
