Table of Contents
Fetching ...

SiDGen: Structure-informed Diffusion for Generative modeling of Ligands for Proteins

Samyak Sanghvi, Nishant Ranjan, Tarak Karmakar

TL;DR

SiDGen addresses the challenge of structure-aware ligand generation under computational constraints by introducing a diffusion-based framework conditioned on protein pockets. It integrates masked SMILES generation with lightweight folding-derived structural features and supports dual conditioning pathways, including a coarse-stride folding mechanism that reduces memory from $O(L^2)$ to $O((L/s)^2)$. The approach is complemented by training enhancements such as curriculum learning and validity penalties, yielding high validity and novelty while maintaining competitive docking and binding-affinity performance. Overall, SiDGen demonstrates scalable, pocket-aware molecular design suitable for high-throughput drug discovery, with clear strengths and identifiable areas for further refinement.

Abstract

Designing ligands that are both chemically valid and structurally compatible with protein binding pockets is a key bottleneck in computational drug discovery. Existing approaches either ignore structural context or rely on expensive, memory-intensive encoding that limits throughput and scalability. We present SiDGen (Structure-informed Diffusion Generator), a protein-conditioned diffusion framework that integrates masked SMILES generation with lightweight folding-derived features for pocket awareness. To balance expressivity with efficiency, SiDGen supports two conditioning pathways: a streamlined mode that pools coarse structural signals from protein embeddings and a full mode that injects localized pairwise biases for stronger coupling. A coarse-stride folding mechanism with nearest-neighbor upsampling alleviates the quadratic memory costs of pair tensors, enabling training on realistic sequence lengths. Learning stability is maintained through in-loop chemical validity checks and an invalidity penalty, while large-scale training efficiency is restored \textit{via} selective compilation, dataloader tuning, and gradient accumulation. In automated benchmarks, SiDGen generates ligands with high validity, uniqueness, and novelty, while achieving competitive performance in docking-based evaluations and maintaining reasonable molecular properties. These results demonstrate that SiDGen can deliver scalable, pocket-aware molecular design, providing a practical route to conditional generation for high-throughput drug discovery.

SiDGen: Structure-informed Diffusion for Generative modeling of Ligands for Proteins

TL;DR

SiDGen addresses the challenge of structure-aware ligand generation under computational constraints by introducing a diffusion-based framework conditioned on protein pockets. It integrates masked SMILES generation with lightweight folding-derived structural features and supports dual conditioning pathways, including a coarse-stride folding mechanism that reduces memory from to . The approach is complemented by training enhancements such as curriculum learning and validity penalties, yielding high validity and novelty while maintaining competitive docking and binding-affinity performance. Overall, SiDGen demonstrates scalable, pocket-aware molecular design suitable for high-throughput drug discovery, with clear strengths and identifiable areas for further refinement.

Abstract

Designing ligands that are both chemically valid and structurally compatible with protein binding pockets is a key bottleneck in computational drug discovery. Existing approaches either ignore structural context or rely on expensive, memory-intensive encoding that limits throughput and scalability. We present SiDGen (Structure-informed Diffusion Generator), a protein-conditioned diffusion framework that integrates masked SMILES generation with lightweight folding-derived features for pocket awareness. To balance expressivity with efficiency, SiDGen supports two conditioning pathways: a streamlined mode that pools coarse structural signals from protein embeddings and a full mode that injects localized pairwise biases for stronger coupling. A coarse-stride folding mechanism with nearest-neighbor upsampling alleviates the quadratic memory costs of pair tensors, enabling training on realistic sequence lengths. Learning stability is maintained through in-loop chemical validity checks and an invalidity penalty, while large-scale training efficiency is restored \textit{via} selective compilation, dataloader tuning, and gradient accumulation. In automated benchmarks, SiDGen generates ligands with high validity, uniqueness, and novelty, while achieving competitive performance in docking-based evaluations and maintaining reasonable molecular properties. These results demonstrate that SiDGen can deliver scalable, pocket-aware molecular design, providing a practical route to conditional generation for high-throughput drug discovery.

Paper Structure

This paper contains 26 sections, 18 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of the SiDGen architecture.
  • Figure 2: Properties of generated molecules.