Table of Contents
Fetching ...

UniCorn: A Unified Contrastive Learning Approach for Multi-view Molecular Representation Learning

Shikun Feng, Yuyan Ni, Minghao Li, Yanwen Huang, Zhi-Ming Ma, Wei-Ying Ma, Yanyan Lan

TL;DR

Lack of a universal molecular pre-training model motivates UniCorn, a unified framework that combines fragment masking on 2D graphs, torsion-augmented denoising on 3D conformations, and cross-modal distillation to learn multi-view molecular representations. The approach formalizes the relationship between reconstructive ($\mathcal{L}_{RC}$) and contrastive ($\mathcal{L}_{CL}$) losses via a regularization term ($\mathcal{L}_{reg}$), establishing mutual bounds with constants $\lambda_{\max}$ and $\lambda_{\min}$, and interprets learned representations as multi-level clusters. UniCorn’s architecture—Fragment Masking Module, Torsion Augmented Denoising Module, and Cross-modal Distillation Module—yields hierarchical embeddings that improve quantum, physicochemical, and biological predictions, achieving state-of-the-art results on QM9, MD17, MD22, and MoleculeNet (33/38 tasks). Ablation and visualization studies confirm the complementary nature and universality of the learned representations. The work offers a principled, theory-driven path toward universal molecular foundation models with practical impact for broad molecular tasks.

Abstract

Recently, a noticeable trend has emerged in developing pre-trained foundation models in the domains of CV and NLP. However, for molecular pre-training, there lacks a universal model capable of effectively applying to various categories of molecular tasks, since existing prevalent pre-training methods exhibit effectiveness for specific types of downstream tasks. Furthermore, the lack of profound understanding of existing pre-training methods, including 2D graph masking, 2D-3D contrastive learning, and 3D denoising, hampers the advancement of molecular foundation models. In this work, we provide a unified comprehension of existing pre-training methods through the lens of contrastive learning. Thus their distinctions lie in clustering different views of molecules, which is shown beneficial to specific downstream tasks. To achieve a complete and general-purpose molecular representation, we propose a novel pre-training framework, named UniCorn, that inherits the merits of the three methods, depicting molecular views in three different levels. SOTA performance across quantum, physicochemical, and biological tasks, along with comprehensive ablation study, validate the universality and effectiveness of UniCorn.

UniCorn: A Unified Contrastive Learning Approach for Multi-view Molecular Representation Learning

TL;DR

Lack of a universal molecular pre-training model motivates UniCorn, a unified framework that combines fragment masking on 2D graphs, torsion-augmented denoising on 3D conformations, and cross-modal distillation to learn multi-view molecular representations. The approach formalizes the relationship between reconstructive () and contrastive () losses via a regularization term (), establishing mutual bounds with constants and , and interprets learned representations as multi-level clusters. UniCorn’s architecture—Fragment Masking Module, Torsion Augmented Denoising Module, and Cross-modal Distillation Module—yields hierarchical embeddings that improve quantum, physicochemical, and biological predictions, achieving state-of-the-art results on QM9, MD17, MD22, and MoleculeNet (33/38 tasks). Ablation and visualization studies confirm the complementary nature and universality of the learned representations. The work offers a principled, theory-driven path toward universal molecular foundation models with practical impact for broad molecular tasks.

Abstract

Recently, a noticeable trend has emerged in developing pre-trained foundation models in the domains of CV and NLP. However, for molecular pre-training, there lacks a universal model capable of effectively applying to various categories of molecular tasks, since existing prevalent pre-training methods exhibit effectiveness for specific types of downstream tasks. Furthermore, the lack of profound understanding of existing pre-training methods, including 2D graph masking, 2D-3D contrastive learning, and 3D denoising, hampers the advancement of molecular foundation models. In this work, we provide a unified comprehension of existing pre-training methods through the lens of contrastive learning. Thus their distinctions lie in clustering different views of molecules, which is shown beneficial to specific downstream tasks. To achieve a complete and general-purpose molecular representation, we propose a novel pre-training framework, named UniCorn, that inherits the merits of the three methods, depicting molecular views in three different levels. SOTA performance across quantum, physicochemical, and biological tasks, along with comprehensive ablation study, validate the universality and effectiveness of UniCorn.
Paper Structure (44 sections, 5 theorems, 22 equations, 5 figures, 14 tables, 2 algorithms)

This paper contains 44 sections, 5 theorems, 22 equations, 5 figures, 14 tables, 2 algorithms.

Key Result

Theorem 2.1

We introduce an additional loss aimed at regularizing the decoder to approximate the inverse of the encoder. When $\lambda_{\text{max}}$ and $\lambda_{\text{min}}$ are non-zero, we derive the following conclusions: indicating that updating the encoder and aligner via contrastive learning and updating the decoder by the regularization loss, guarantees a small reconstructive loss. indicating that

Figures (5)

  • Figure 1: Correspondence between self-supervised learning (SSL) methods, views of molecules, and molecular properties. Different SSL methods cluster the molecular representations based on different levels of similarity (section \ref{['sec:theory unify methods']}). These clustering patterns align with the characteristics of properties at different scales (section \ref{['sec:relation between method&task']}).
  • Figure 2: A unified perspective of reconstructive and contrastive methods.
  • Figure 3: The UniCorn architecture we have ultimately reached, after exploring the association between each pre-training method and downstream tasks, with the goal of approaching unified molecular representations. The Top illustrates the Fragment Masking Module, wherein a 2D molecular graph is masked by fragments and subsequently recovered. The Bottom showcases the Torsion Augmented Denoising Module. This module operates in two steps: initially augmenting 3D conformers by perturbing rotatable torsions, and then introducing Gaussian coordinate noise for denoising. Finally, the Middle introduces the Cross-modal Distillation Module, responsible for distilling knowledge from 2D to 3D to achieve a hierarchical molecular representation.
  • Figure 4: The visualization showcases hierarchical molecular representations learned by UniCorn, illustrating its ability to achieve effective clustering across different levels.
  • Figure 5: The clustering results of unfine-tuned molecular representations by three distinct methods, across three diverse tasks: BBBP (biological task), Freesolv (physicochemical task), and homo from QM9 (quantum task). The color indicates the labels of the downstream tasks—discrete binary labels for the BBBP task and continuous labels for Freesolv and QM9. Below each subfigure we present Davies–Bouldin Index to evaluate the performance of clustering results(smaller is better). While the Masking and Denoising methods exhibit a preference for biological and quantum tasks respectively, UniCorn demonstrates the capability to achieve significant clustering results across all types of tasks.

Theorems & Definitions (13)

  • Theorem 2.1: Relations between reconstructive and contrastive loss
  • Theorem 2.2
  • Definition 2.1: Reconstruction loss
  • Definition 2.2: Contrastive loss
  • Definition 2.3: Regularization loss
  • Definition 2.4: Modified reconstruction loss
  • Lemma 2.5
  • proof
  • Lemma 2.6
  • proof
  • ...and 3 more