Table of Contents
Fetching ...

Molecule Generation for Target Protein Binding with Hierarchical Consistency Diffusion Model

Guanlue Li, Chenran Jiang, Ziqi Gao, Yu Liu, Chenyang Liu, Jiean Chen, Yong Huang, Jia Li

TL;DR

AMDiff addresses the challenge of de novo ligand design conditioned on target proteins by introducing a hierarchical diffusion framework that jointly models atom-level and motif-level ligand representations. It leverages classifier-free guidance and binding-site conditioning, augmented by topological features, to generate valid, diverse, and high-affinity molecules. Across CrossDocked benchmarks and kinase targets ALK and CDK4, AMDiff demonstrates superior validity, diversity, novelty, and docking performance, while remaining robust to pocket size changes and protein mutations. By bridging atom-level detail with motif-level priors and enabling cross-view information exchange, AMDiff advances structure-based drug design and has potential to speed up target-aware molecular generation in drug discovery.

Abstract

Effective generation of molecular structures, or new chemical entities, that bind to target proteins is crucial for lead identification and optimization in drug discovery. Despite advancements in atom- and motif-wise deep learning models for 3D molecular generation, current methods often struggle with validity and reliability. To address these issues, we develop the Atom-Motif Consistency Diffusion Model (AMDiff), utilizing a joint-training paradigm for multi-view learning. This model features a hierarchical diffusion architecture that integrates both atom- and motif-level views of molecules, allowing for comprehensive exploration of complementary information. By leveraging classifier-free guidance and incorporating binding site features as conditional inputs, AMDiff ensures robust molecule generation across diverse targets. Compared to existing approaches, AMDiff exhibits superior validity and novelty in generating molecules tailored to fit various protein pockets. Case studies targeting protein kinases, including Anaplastic Lymphoma Kinase (ALK) and Cyclin-dependent kinase 4 (CDK4), demonstrate the model's capability in structure-based de novo drug design. Overall, AMDiff bridges the gap between atom-view and motif-view drug discovery and speeds up the process of target-aware molecular generation.

Molecule Generation for Target Protein Binding with Hierarchical Consistency Diffusion Model

TL;DR

AMDiff addresses the challenge of de novo ligand design conditioned on target proteins by introducing a hierarchical diffusion framework that jointly models atom-level and motif-level ligand representations. It leverages classifier-free guidance and binding-site conditioning, augmented by topological features, to generate valid, diverse, and high-affinity molecules. Across CrossDocked benchmarks and kinase targets ALK and CDK4, AMDiff demonstrates superior validity, diversity, novelty, and docking performance, while remaining robust to pocket size changes and protein mutations. By bridging atom-level detail with motif-level priors and enabling cross-view information exchange, AMDiff advances structure-based drug design and has potential to speed up target-aware molecular generation in drug discovery.

Abstract

Effective generation of molecular structures, or new chemical entities, that bind to target proteins is crucial for lead identification and optimization in drug discovery. Despite advancements in atom- and motif-wise deep learning models for 3D molecular generation, current methods often struggle with validity and reliability. To address these issues, we develop the Atom-Motif Consistency Diffusion Model (AMDiff), utilizing a joint-training paradigm for multi-view learning. This model features a hierarchical diffusion architecture that integrates both atom- and motif-level views of molecules, allowing for comprehensive exploration of complementary information. By leveraging classifier-free guidance and incorporating binding site features as conditional inputs, AMDiff ensures robust molecule generation across diverse targets. Compared to existing approaches, AMDiff exhibits superior validity and novelty in generating molecules tailored to fit various protein pockets. Case studies targeting protein kinases, including Anaplastic Lymphoma Kinase (ALK) and Cyclin-dependent kinase 4 (CDK4), demonstrate the model's capability in structure-based de novo drug design. Overall, AMDiff bridges the gap between atom-view and motif-view drug discovery and speeds up the process of target-aware molecular generation.

Paper Structure

This paper contains 18 sections, 16 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: (a) Ideal outputs and disadvantages of atom-based and motif-based methods for structure-based drug design. In atom-based methods (on the left), individual atoms serve as the fundamental units to construct highly diverse molecular structures. While these methods excel in generating variety, they often struggle to maintain coherence and realism in substructure formation, frequently leading to the creation of bonds with incorrect lengths and angles. Moreover, atom-based approaches can inadvertently produce unstable configurations such as three-membered rings. Conversely, motif-based methods (on the right) utilize predefined building blocks sourced from a motif vocabulary derived from existing datasets and chemical knowledge. However, these methods face limitations when desired motifs, such as 1H-pyrrolo[3,2-b]pyridine and pyrazolo[1,5-a][1,3,5]triazine, are absent from the vocabulary, potentially limiting structural diversity. Additionally, conflicts may arise in connecting different motifs, posing further challenges in generating cohesive structures. (b) The illustration of hierarchical-interaction information for ligand generation in this work. A ligand is decomposed into atoms and motifs, respectively. In the atom-view and motif-view, interaction details between the ligand and protein (represented by red dotted lines) are gathered using dedicated message passing networks. Additionally, for cross-view interactions (indicated by purple dashed line), facilitate the exchange of clustering and positioning information between the atom and motif views. (c) The AMDiff architecture is a diffusion-based model for hierarchical molecular generation. The AMDiff architecture is centered on a diffusion model that integrates atom-view and motif-view perspectives, each crucial for molecular generation. This model employs a conditional diffusion approach to recover noisy molecular structures and generate new ones through interactive denoising. In the atom-view, the model predicts atom types and positions, while in the motif-view, it constructs motif trees and generates predictions based on them. This architectural design fosters effective information exchange between views, providing valuable insights across various granularity levels in molecular structures.
  • Figure 2: Quantitative evaluations of the models targeting the CrossDocked francoeur2020three test set. (a-d) The distribution of the following metrics: (a) Docking score; (b) Molecular weight; (c) QED; (d) SA, comparing AMDiff (purple), Pocket2Mol (blue), FLAG (green), Train set (red), and Test set (yellow) molecules. (e) visualizes the 3D shape distribution of the generated molecules using NPR descriptors.
  • Figure 3: Quantitative evaluations of the models targeting ALK (PDB id: 3LCS). The distributions of the following metrics were analyzed: (a) Docking score; (b) Molecular weight; (c) QED; (d) LogP, comparing the performance of AMDiff (purple), Pocket2Mol (yellow), FLAG (green) models, molecules, and bioactive ligands (red). (e) The distribution of Docking score, QED, and SA score for the generated samples was visualized. The drug-like region with QED $\geq$ 0.65 and docking score $\leq$ -8.5 (kcal/mol) is indicated with red boxes. (f) the RMSD was calculated to determine the conformational changes before and after the docking process.
  • Figure 4: Examples of Molecules Generated by AMDiff targeting CDK 4 (PDB id:7SJ3). (a) An example of a conditional design trajectory . At initial time steps, substructures progressively explore interactions with the pocket in both atom-view and motif-view. The trajectory gradually refines into a realistic molecule structure. (b) Molecules designed to target CDK 4 (PDB id:7SJ3), with molecular properties such as QED and SA score, as well as binding affinity and protein-ligand interaction analysis.
  • Figure 5: (a) The distribution of molecules generated after mutating ALK (PDBID: 3LCS) is shown. The clustering results of USRCAT fingerprints for molecules targeting three mutations were visualized using t-SNE in two-dimensional space. $\operatorname{ALK^{WT}}$: Wild-type ALK proteins form PDB bank (PDB id: 3LCS). $\operatorname{AF-ALK^{WT}}$: Wild-type ALK proteins form Alphafold (PDB id: 3LCS). $\operatorname{AF-ALK^{G1202R}}$: A substitution of the amino acid Gly with Arg at position 1202 in the protein sequence. $\operatorname{AF-ALK^{S1206Y}}$: A substitution of the amino acid Ser with Tyr at position 1206 in the protein sequence. (b) Examples of molecules generated after modifying residues within the pocket of $\operatorname{AF-ALK^{G1202R}}$, $\operatorname{AF-ALK^{S1206Y}}$ and $\operatorname{AF-ALK^{WT}}$. The mutated (c) Conditional generation of molecules for various pocket sizes targeting ALK (PDB ID: 3LCS). Comparison of key property performance when utilizing binding pockets of varying sizes, including docking score, molecular weight, SA and QED. (d) Visualization examples showcasing generated samples adjusted to match different pocket sizes. The molecular volumes are tailored to correspond with the given pocket volumes.