Table of Contents
Fetching ...

A Survey of Generative AI for de novo Drug Design: New Frontiers in Molecule and Protein Generation

Xiangru Tang, Howard Dai, Elizabeth Knight, Fang Wu, Yunyang Li, Tianxiao Li, Mark Gerstein

TL;DR

The survey catalogues how generative AI accelerates de novo drug design by organizing methods into small-molecule and protein design, detailing representative architectures (VAEs, GANs, flows, diffusion) and their integration with graph and geometric representations. It systematically maps task definitions, datasets, metrics, and state-of-the-art models across target-agnostic and target-aware molecular design, as well as comprehensive protein design tasks including representation learning, structure and sequence generation, and antibody-focused subfields. The work highlights diffusion models and equivariant graph networks as dominant recent trends, and emphasizes practical challenges such as benchmarking standardization, validation breadth, and explainability. It also points to future directions, including more realistic evaluation pipelines, richer multimodal representations, and closer alignment with experimental validation to enable reliable, scalable drug discovery. The accompanying repository complements the survey by offering organized access to cited sources and datasets to foster collaboration in this rapidly evolving field.

Abstract

Artificial intelligence (AI)-driven methods can vastly improve the historically costly drug design process, with various generative models already in widespread use. Generative models for de novo drug design, in particular, focus on the creation of novel biological compounds entirely from scratch, representing a promising future direction. Rapid development in the field, combined with the inherent complexity of the drug design process, creates a difficult landscape for new researchers to enter. In this survey, we organize de novo drug design into two overarching themes: small molecule and protein generation. Within each theme, we identify a variety of subtasks and applications, highlighting important datasets, benchmarks, and model architectures and comparing the performance of top models. We take a broad approach to AI-driven drug design, allowing for both micro-level comparisons of various methods within each subtask and macro-level observations across different fields. We discuss parallel challenges and approaches between the two applications and highlight future directions for AI-driven de novo drug design as a whole. An organized repository of all covered sources is available at https://github.com/gersteinlab/GenAI4Drug.

A Survey of Generative AI for de novo Drug Design: New Frontiers in Molecule and Protein Generation

TL;DR

The survey catalogues how generative AI accelerates de novo drug design by organizing methods into small-molecule and protein design, detailing representative architectures (VAEs, GANs, flows, diffusion) and their integration with graph and geometric representations. It systematically maps task definitions, datasets, metrics, and state-of-the-art models across target-agnostic and target-aware molecular design, as well as comprehensive protein design tasks including representation learning, structure and sequence generation, and antibody-focused subfields. The work highlights diffusion models and equivariant graph networks as dominant recent trends, and emphasizes practical challenges such as benchmarking standardization, validation breadth, and explainability. It also points to future directions, including more realistic evaluation pipelines, richer multimodal representations, and closer alignment with experimental validation to enable reliable, scalable drug discovery. The accompanying repository complements the survey by offering organized access to cited sources and datasets to foster collaboration in this rapidly evolving field.

Abstract

Artificial intelligence (AI)-driven methods can vastly improve the historically costly drug design process, with various generative models already in widespread use. Generative models for de novo drug design, in particular, focus on the creation of novel biological compounds entirely from scratch, representing a promising future direction. Rapid development in the field, combined with the inherent complexity of the drug design process, creates a difficult landscape for new researchers to enter. In this survey, we organize de novo drug design into two overarching themes: small molecule and protein generation. Within each theme, we identify a variety of subtasks and applications, highlighting important datasets, benchmarks, and model architectures and comparing the performance of top models. We take a broad approach to AI-driven drug design, allowing for both micro-level comparisons of various methods within each subtask and macro-level observations across different fields. We discuss parallel challenges and approaches between the two applications and highlight future directions for AI-driven de novo drug design as a whole. An organized repository of all covered sources is available at https://github.com/gersteinlab/GenAI4Drug.
Paper Structure (95 sections, 16 equations, 5 figures, 12 tables)

This paper contains 95 sections, 16 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: An overview of the topics covered in this survey. In particular, we explore the intersection between generative AI model architectures and real-world applications, organized into two main categories: small molecule and protein generation tasks. Note that diffusion and flow-based models are often paired with GNNs for processing 2D/3D-based input, while VAEs and GANs are typically used for 1D input dong2020molecularWeng_2021ganapathy2013crystalSilva_2021ProteinEmbl-Ebizhao2018silico.
  • Figure 2: A structured layout for all terms and papers covered in our survey, including datasets, models, and metrics for each task. Sections contained in the main text are highlighted in blue, while sections expanded upon in the appendix are highlighted in purple.
  • Figure 3: An overview of the progress in target-agnostic molecule design over time. Shortcomings of previous models are shown in the corresponding pink boxes, with subsequent models solving these shortcomings through novel design choices gomez2018automatichoogeboom2022equivariantjin2018junctionxu2023geometrichuang2022mdmhuang2023learning.
  • Figure 4: An overview of the progress in protein generation over time. Shortcomings of previous models are shown in the corresponding pink boxes, with subsequent models solving these shortcomings through novel design choices sevgen2023prottrippe2022diffusionwu2022proteinLatentDiffyim2023seshi2022protein. For consistency, only methods that generate proteins from scratch (without fixed backbone or sequence input) are depicted.
  • Figure 5: A comprehensive overview of the antibody generation pipeline for CDR-H3 design Silva_2021zhao2018silicobocharov2018basicstanzionechapterambrosetti2020protocol. The inputs are a target antigen and antibody information (without CDR-H3), and the output is an antibody-antigen complex with a designed CDR-H3 sequence. Note that while most antibody CDR-H3 generation methods only generate the CDR-H3 region, needing a docked structure as input, some methods like DockGPT mcpartlon2023deep, HERN jin2022antibody, and dyMEAN kong2023end perform multiple steps of the pipeline on their own.