Table of Contents
Fetching ...

A Comprehensive Review of Emerging Approaches in Machine Learning for De Novo PROTAC Design

Yossra Gharbi, Rocío Mercado

TL;DR

The paper surveys machine learning approaches for de novo PROTAC design, focusing first on the specialized challenges of PROTAC linker design and then on holistic PROTAC design that optimizes warhead, E3 ligase ligand, and linker. It reviews 2D and 3D generative models, reinforcement learning, and degradation-activity surrogates, highlighting key datasets like PROTAC-DB and PROTACpedia and noting the limitations imposed by data scarcity and reliance on small-molecule training. The authors underscore the critical role of 3D information and ternary-complex modeling in PROTAC design, discuss current limitations of existing ML tools when applied to this modality, and point to emerging directions such as diffusion models and transfer learning. The work provides a roadmap for future ML-driven PROTAC engineering, emphasizing tailored datasets, physics-informed modeling, and methods capable of capturing the spatial dynamics essential for effective targeted protein degradation.

Abstract

Targeted protein degradation (TPD) is a rapidly growing field in modern drug discovery that aims to regulate the intracellular levels of proteins by harnessing the cell's innate degradation pathways to selectively target and degrade disease-related proteins. This strategy creates new opportunities for therapeutic intervention in cases where occupancy-based inhibitors have not been successful. Proteolysis-targeting chimeras (PROTACs) are at the heart of TPD strategies, which leverage the ubiquitin-proteasome system for the selective targeting and proteasomal degradation of pathogenic proteins. As the field evolves, it becomes increasingly apparent that the traditional methodologies for designing such complex molecules have limitations. This has led to the use of machine learning (ML) and generative modeling to improve and accelerate the development process. In this review, we explore the impact of ML on de novo PROTAC design $-$ an aspect of molecular design that has not been comprehensively reviewed despite its significance. We delve into the distinct characteristics of PROTAC linker design, underscoring the complexities required to create effective bifunctional molecules capable of TPD. We then examine how ML in the context of fragment-based drug design (FBDD), honed in the realm of small-molecule drug discovery, is paving the way for PROTAC linker design. Our review provides a critical evaluation of the limitations inherent in applying this method to the complex field of PROTAC development. Moreover, we review existing ML works applied to PROTAC design, highlighting pioneering efforts and, importantly, the limitations these studies face. By offering insights into the current state of PROTAC development and the integral role of ML in PROTAC design, we aim to provide valuable perspectives for researchers in their pursuit of better design strategies for this new modality.

A Comprehensive Review of Emerging Approaches in Machine Learning for De Novo PROTAC Design

TL;DR

The paper surveys machine learning approaches for de novo PROTAC design, focusing first on the specialized challenges of PROTAC linker design and then on holistic PROTAC design that optimizes warhead, E3 ligase ligand, and linker. It reviews 2D and 3D generative models, reinforcement learning, and degradation-activity surrogates, highlighting key datasets like PROTAC-DB and PROTACpedia and noting the limitations imposed by data scarcity and reliance on small-molecule training. The authors underscore the critical role of 3D information and ternary-complex modeling in PROTAC design, discuss current limitations of existing ML tools when applied to this modality, and point to emerging directions such as diffusion models and transfer learning. The work provides a roadmap for future ML-driven PROTAC engineering, emphasizing tailored datasets, physics-informed modeling, and methods capable of capturing the spatial dynamics essential for effective targeted protein degradation.

Abstract

Targeted protein degradation (TPD) is a rapidly growing field in modern drug discovery that aims to regulate the intracellular levels of proteins by harnessing the cell's innate degradation pathways to selectively target and degrade disease-related proteins. This strategy creates new opportunities for therapeutic intervention in cases where occupancy-based inhibitors have not been successful. Proteolysis-targeting chimeras (PROTACs) are at the heart of TPD strategies, which leverage the ubiquitin-proteasome system for the selective targeting and proteasomal degradation of pathogenic proteins. As the field evolves, it becomes increasingly apparent that the traditional methodologies for designing such complex molecules have limitations. This has led to the use of machine learning (ML) and generative modeling to improve and accelerate the development process. In this review, we explore the impact of ML on de novo PROTAC design an aspect of molecular design that has not been comprehensively reviewed despite its significance. We delve into the distinct characteristics of PROTAC linker design, underscoring the complexities required to create effective bifunctional molecules capable of TPD. We then examine how ML in the context of fragment-based drug design (FBDD), honed in the realm of small-molecule drug discovery, is paving the way for PROTAC linker design. Our review provides a critical evaluation of the limitations inherent in applying this method to the complex field of PROTAC development. Moreover, we review existing ML works applied to PROTAC design, highlighting pioneering efforts and, importantly, the limitations these studies face. By offering insights into the current state of PROTAC development and the integral role of ML in PROTAC design, we aim to provide valuable perspectives for researchers in their pursuit of better design strategies for this new modality.

Paper Structure

This paper contains 14 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: (a) A PROTAC is a hetero-bifunctional molecule, consisting of a ligand (blue triangle) that recruits an E3 ubiquitin ligase, a warhead (orange circle) that binds to the POI, and a linker (blue curve) that connects the two binding moieties. The PROTAC functions by simultaneously binding to the POI and the E3 ligase, thus bringing them into close proximity and inducing the formation of a ternary complex. (b) The PROTAC MoA begins with an E1 ubiquitin-activating enzyme that activates ubiquitin (Ub) in an ATP-dependent manner. This activated Ub is then transferred to an E2 Ub-conjugating enzyme. Subsequently, a PROTAC simultaneously binds to the POI and an E3 ubiquitin ligase, bringing them into close proximity. This facilitates the transfer of Ub from the E2 enzyme to the POI, catalyzed by the E3 ligase. The polyubiquitinated POI is then recognized and degraded by the proteasome into smaller peptides, and the PROTAC is released back into the cellular environment where it can be reused, initiating the process again with another instance of the same POI. (c) Visual representations of dBET6 and its respective ternary complex: left -- a 2D skeletal formula of the PROTAC molecule dBET6; middle -- a close-up of the dBET6 degrader's three-dimensional (3D) structure in complex with CRBN and BRD4 (PDBID:6BOY), emphasizing the importance of the PROTAC's spatial orientation in forming a good ternary complex; and right -- a space filling model for the same complex, involving BRD4, CRBN, DNA damage-binding protein 1 (DDB1), and dBET6. Color key: BRD4 (green), CRBN (cyan), and DDB1 (dark blue).
  • Figure 2: (a) An overview of fragment-based drug design (FBDD). The initial step involves fragment screening to identify potential fragments that can bind to the pocket of the target protein. These fragments are then linked and optimized to improve their binding properties. The result is a strongly-bound ligand that fits precisely within the target protein's pocket. (b) left -- The linker in a PROTAC isn't just a passive bridge. It's an important component that enhances the interaction dynamics between the POI and the E3. right -- The linker also contributes to the PROTAC's overall PK profile, including cell permeability. center -- Because its MoA relies on transient ternary complex formation, the PROTAC is eventually released, meaning it is catalytic and can go on to be reused for other processes inside the cell. (c) The large and multivalent nature of PROTACs means they require a more complex design approach than FBDD methods developed for small molecules. The linker must be long and/or flexible enough to allow the warhead and E3 ligase ligand to adopt the necessary conformations for effective ternary complex formation, but not too flexible that the PROTAC cannot maintain the correct spatial orientation of the warhead and E3 ligase ligand. The linker may also need to incorporate specific chemical groups to enhance the overall potency of the PROTAC.
  • Figure 3: The distributions of various molecular descriptors in PROTACs versus small molecules. PROTACs were downloaded from PROTAC-DB and PROTACpedia, while small molecules were randomly sampled from ZINC-250kirwin2020zinc20, a popular database used in drug discovery containing commercially-available compounds for virtual screening (e.g., drug-like compounds). This comparative analysis of their chemical and physical properties highlights the differences between both classes of molecules. The descriptors include molecular weight, partition coefficient (LogP), number of rotatable bonds, number of hydrogen bond donors (HBDs) and acceptors (HBAs), and normalized atom counts for carbon.