Table of Contents
Fetching ...

MolChord: Structure-Sequence Alignment for Protein-Guided Drug Design

Wei Zhang, Zekun Guo, Yingce Xia, Peiran Jin, Shufang Xie, Tao Qin, Xiang-Yang Li

TL;DR

MolChord is proposed, which integrates two key techniques to align protein and molecule structures with their textual descriptions and sequential representations, and to guide molecules toward desired properties by integrating preference data and refine the alignment process using Direct Preference Optimization (DPO).

Abstract

Structure-based drug design (SBDD), which maps target proteins to candidate molecular ligands, is a fundamental task in drug discovery. Effectively aligning protein structural representations with molecular representations, and ensuring alignment between generated drugs and their pharmacological properties, remains a critical challenge. To address these challenges, we propose MolChord, which integrates two key techniques: (1) to align protein and molecule structures with their textual descriptions and sequential representations (e.g., FASTA for proteins and SMILES for molecules), we leverage NatureLM, an autoregressive model unifying text, small molecules, and proteins, as the molecule generator, alongside a diffusion-based structure encoder; and (2) to guide molecules toward desired properties, we curate a property-aware dataset by integrating preference data and refine the alignment process using Direct Preference Optimization (DPO). Experimental results on CrossDocked2020 demonstrate that our approach achieves state-of-the-art performance on key evaluation metrics, highlighting its potential as a practical tool for SBDD.

MolChord: Structure-Sequence Alignment for Protein-Guided Drug Design

TL;DR

MolChord is proposed, which integrates two key techniques to align protein and molecule structures with their textual descriptions and sequential representations, and to guide molecules toward desired properties by integrating preference data and refine the alignment process using Direct Preference Optimization (DPO).

Abstract

Structure-based drug design (SBDD), which maps target proteins to candidate molecular ligands, is a fundamental task in drug discovery. Effectively aligning protein structural representations with molecular representations, and ensuring alignment between generated drugs and their pharmacological properties, remains a critical challenge. To address these challenges, we propose MolChord, which integrates two key techniques: (1) to align protein and molecule structures with their textual descriptions and sequential representations (e.g., FASTA for proteins and SMILES for molecules), we leverage NatureLM, an autoregressive model unifying text, small molecules, and proteins, as the molecule generator, alongside a diffusion-based structure encoder; and (2) to guide molecules toward desired properties, we curate a property-aware dataset by integrating preference data and refine the alignment process using Direct Preference Optimization (DPO). Experimental results on CrossDocked2020 demonstrate that our approach achieves state-of-the-art performance on key evaluation metrics, highlighting its potential as a practical tool for SBDD.

Paper Structure

This paper contains 60 sections, 8 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Overview of MolChord. For each input, unmarked text tokens are embedded by the language model, while color-marked entities ($\langle\rm 3d\; molecule\rangle$, $\langle\rm 3d\; protein\rangle$, or $\langle\rm 3d\; complex\rangle$) are processed by the Encoder. In Stage B, protein–ligand complexes are further processed through a VAE to perturb protein features, and only pocket features are injected into the language model. The bottom panel illustrates Stage C, where Direct Preference Optimization (DPO) is applied.
  • Figure 2: Visualizations of reference molecules and ligands generated by MolChord, MolChord-RL, and MolChord-RL$^{\rm dock}$ for protein pocket 1gg5. Vina score and SA are reported.
  • Figure 3: Barplot of the number of fused rings in top-ranked compounds generated by representative methods. For each method, statistics of 1,000 compounds (100 targets × 10 compounds with the highest docking scores) are reported.
  • Figure 4: OOD generalization: average Vina Dock scores on homologous vs non-homologous proteins for representative methods.
  • Figure 5: Distribution of candidate ligands per target in the CrossDocked2020 dataset. Targets are sorted by ligand count, with a dashed line marking the partition at 2 ligands, where the red and blue regions correspond to $\mathcal{D}_{\rm B}$ and $\mathcal{D}_{\rm C}$, respectively.
  • ...and 1 more figures