Table of Contents
Fetching ...

Multi-channel learning for integrating structural hierarchies into context-dependent molecular representation

Yue Wan, Jialu Wu, Tingjun Hou, Chang-Yu Hsieh, Xiaowei Jia

TL;DR

The paper tackles data scarcity and activity cliffs in molecular property prediction by introducing a prompt-guided, multi-channel self-supervised pre-training framework. It learns separate representations through molecule-distancing, scaffold-distancing, and context-prediction channels, and combines them during fine-tuning via a task-specific prompt mechanism. A novel adaptive-margin triplet loss and scaffold-invariant perturbations improve discrimination among closely related structures while preserving chemical knowledge. Across MoleculeNet and MoleculeACE benchmarks, the approach achieves state-of-the-art or competitive performance and shows enhanced robustness to activity cliffs, with representation-space analyses demonstrating better knowledge preservation during fine-tuning and interpretable channel contributions. This framework offers a principled path toward context-dependent, chemistry-aware molecular representations with practical implications for drug discovery and beyond.

Abstract

Reliable molecular property prediction is essential for various scientific endeavors and industrial applications, such as drug discovery. However, the data scarcity, combined with the highly non-linear causal relationships between physicochemical and biological properties and conventional molecular featurization schemes, complicates the development of robust molecular machine learning models. Self-supervised learning (SSL) has emerged as a popular solution, utilizing large-scale, unannotated molecular data to learn a foundational representation of chemical space that might be advantageous for downstream tasks. Yet, existing molecular SSL methods largely overlook chemical knowledge, including molecular structure similarity, scaffold composition, and the context-dependent aspects of molecular properties when operating over the chemical space. They also struggle to learn the subtle variations in structure-activity relationship. This paper introduces a novel pre-training framework that learns robust and generalizable chemical knowledge. It leverages the structural hierarchy within the molecule, embeds them through distinct pre-training tasks across channels, and aggregates channel information in a task-specific manner during fine-tuning. Our approach demonstrates competitive performance across various molecular property benchmarks and offers strong advantages in particularly challenging yet ubiquitous scenarios like activity cliffs.

Multi-channel learning for integrating structural hierarchies into context-dependent molecular representation

TL;DR

The paper tackles data scarcity and activity cliffs in molecular property prediction by introducing a prompt-guided, multi-channel self-supervised pre-training framework. It learns separate representations through molecule-distancing, scaffold-distancing, and context-prediction channels, and combines them during fine-tuning via a task-specific prompt mechanism. A novel adaptive-margin triplet loss and scaffold-invariant perturbations improve discrimination among closely related structures while preserving chemical knowledge. Across MoleculeNet and MoleculeACE benchmarks, the approach achieves state-of-the-art or competitive performance and shows enhanced robustness to activity cliffs, with representation-space analyses demonstrating better knowledge preservation during fine-tuning and interpretable channel contributions. This framework offers a principled path toward context-dependent, chemistry-aware molecular representations with practical implications for drug discovery and beyond.

Abstract

Reliable molecular property prediction is essential for various scientific endeavors and industrial applications, such as drug discovery. However, the data scarcity, combined with the highly non-linear causal relationships between physicochemical and biological properties and conventional molecular featurization schemes, complicates the development of robust molecular machine learning models. Self-supervised learning (SSL) has emerged as a popular solution, utilizing large-scale, unannotated molecular data to learn a foundational representation of chemical space that might be advantageous for downstream tasks. Yet, existing molecular SSL methods largely overlook chemical knowledge, including molecular structure similarity, scaffold composition, and the context-dependent aspects of molecular properties when operating over the chemical space. They also struggle to learn the subtle variations in structure-activity relationship. This paper introduces a novel pre-training framework that learns robust and generalizable chemical knowledge. It leverages the structural hierarchy within the molecule, embeds them through distinct pre-training tasks across channels, and aggregates channel information in a task-specific manner during fine-tuning. Our approach demonstrates competitive performance across various molecular property benchmarks and offers strong advantages in particularly challenging yet ubiquitous scenarios like activity cliffs.
Paper Structure (9 sections, 5 equations, 5 figures, 1 table)

This paper contains 9 sections, 5 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Framework overview. a) The prompt guided pretrain-finetune framework. For each downstream task, the model is optimized additionally on the prompt weight selection, locating the best pre-trained channel compatible with the current application. b) Molecule contrastive learning (MCD), where the positive samples $G_i'$ from subgraph masking is contrasted against negative samples $G_j$ by an adaptive margin. c) Scaffold contrastive distancing (SCD), where the positive samples $G_i'$ from scaffold-invariant perturbation is contrasted by against negative samples $G_j$ by an adaptive margin. d) Context prediction (CP) channel consists of masked subgraph prediction and motif prediction tasks. e) Prompt-guided aggregation module, which conditionally aggregates atom representations into molecule representation by prompt token. It is realized via a multi-head attention with prompt embedding $h_p$ being the query.
  • Figure 2: Binding potency prediction. a) The violin plot illustrates model performance across 30 binding potency prediction tasks in MoleculeACE MoleculeACE. Performance on each dataset is averaged across three independent runs using stratified splits. The asterisk (*) indicates that the model is trained from scratch (no pre-train). The ordering of the methods from left to right is sorted by the average performance. The width of each violin represents the density of data points, with wider sections indicating a higher concentration of values. The minima and maxima are displayed by the lower and upper ends of the violin. The middle solid white line represents the median value. b) Average model performance across 30 binding potency prediction tasks with respect to model sizes (also indicated by the size of the dots). The dashed line follows the performance of the multi-layer perceptron (MLP), which serves as the baseline comparison between traditional fingerprints and deep molecular representations. c) Relationship between model performance and the presence of activity cliffs in each dataset measured by the roughness of molecular property landscape. Performance on each dataset is averaged across three independent runs using random splits. The size of the datasets is indicated by the size of the dots. The slope of the fitted line of each method indicates the correlation. Ours$_\text{GPS}$ is used in this analysis.
  • Figure 3: Representation space probing. The dynamics of molecule representation space of three methods at five finetune timestamps on dataset CHEMBL237_Ki ($n=2602$). For each row pair, a 2D view of the representation space of both the training (top) and validation set (bottom) are visualized. The color map represents the normalized binding potency of each data point from lowest to highest within the dataset. Rand index and roughness index (ROGI) are reported for training, while R-squared value and cliff-noncliff distance ratio are reported for validation. The grey circles correspond to the clustering assignment using ECFP4 fingerprint. The red arrow indicates the distance of a cliff molecule pair, while the blue arrow indicates that of a non-cliff molecule pair.
  • Figure 4: Representation shift decomposition. Detailed analysis of the representation shift during fine-tuning across three datasets. The dashed lines (top row) correspond to the representation shift (approximated by the Rand index) for each individual channel, while the solid lines represent the overall method. The bottom row displays the optimized prompt weights distributed over channels.
  • Figure 5: Activity cliffs analysis. Evaluate the bioactivity binding mode and the atom importance captured by our model on a (a) 5-phenyl-4-phenyldiazenyl-1,2-dihydropyrazol-3-one series and a (b) N-phenyl-4-pyrazolo[1,5-b]pyridazin-3-ylpyrimidin-2-amine series. The binding modes of compound $a1$ and compound $b1$ in the active sites of GSK3$\beta$, with Protein Data Bank (PDB) ID: 3L1S, are predicted by AutoDock Vina trott2010autodock and visualized by PyMOL delano2002pymol. The key hydrogen bonds between compounds and the active sites are highlighted by yellow dash lines, while the green dash line refers to the intra-molecular halogen bond. The color intensity on each atom indicates its respective contribution to the prediction of our method, computed from a GNNExplainer gnnexplainer.