Table of Contents
Fetching ...

Beam Enumeration: Probabilistic Explainability For Sample Efficient Self-conditioned Molecular Design

Jeff Guo, Philippe Schwaller

TL;DR

The paper tackles explainability and sample efficiency in language-based molecular design by introducing Beam Enumeration, which exhaustively enumerates high-probability token sub-sequences to extract meaningful molecular substructures. These substructures enable self-conditioned generation and provide a probabilistic form of explainability, and when combined with Augmented Memory (or REINVENT) substantially improves sample efficiency, reducing expensive oracle calls while increasing high-reward molecule yield. Across illustrative experiments and three docking-focused drug-discovery case studies, the approach yields more high-reward molecules under the same oracle budget, often within a few thousand calls, and demonstrates a synergistic trade-off between explainability and exploration. Overall, Beam Enumeration is presented as a task-agnostic method that can enhance existing generative design pipelines and potentially empower optimization of expensive physics-based oracles.

Abstract

Generative molecular design has moved from proof-of-concept to real-world applicability, as marked by the surge in very recent papers reporting experimental validation. Key challenges in explainability and sample efficiency present opportunities to enhance generative design to directly optimize expensive high-fidelity oracles and provide actionable insights to domain experts. Here, we propose Beam Enumeration to exhaustively enumerate the most probable sub-sequences from language-based molecular generative models and show that molecular substructures can be extracted. When coupled with reinforcement learning, extracted substructures become meaningful, providing a source of explainability and improving sample efficiency through self-conditioned generation. Beam Enumeration is generally applicable to any language-based molecular generative model and notably further improves the performance of the recently reported Augmented Memory algorithm, which achieved the new state-of-the-art on the Practical Molecular Optimization benchmark for sample efficiency. The combined algorithm generates more high reward molecules and faster, given a fixed oracle budget. Beam Enumeration shows that improvements to explainability and sample efficiency for molecular design can be made synergistic.

Beam Enumeration: Probabilistic Explainability For Sample Efficient Self-conditioned Molecular Design

TL;DR

The paper tackles explainability and sample efficiency in language-based molecular design by introducing Beam Enumeration, which exhaustively enumerates high-probability token sub-sequences to extract meaningful molecular substructures. These substructures enable self-conditioned generation and provide a probabilistic form of explainability, and when combined with Augmented Memory (or REINVENT) substantially improves sample efficiency, reducing expensive oracle calls while increasing high-reward molecule yield. Across illustrative experiments and three docking-focused drug-discovery case studies, the approach yields more high-reward molecules under the same oracle budget, often within a few thousand calls, and demonstrates a synergistic trade-off between explainability and exploration. Overall, Beam Enumeration is presented as a task-agnostic method that can enhance existing generative design pipelines and potentially empower optimization of expensive physics-based oracles.

Abstract

Generative molecular design has moved from proof-of-concept to real-world applicability, as marked by the surge in very recent papers reporting experimental validation. Key challenges in explainability and sample efficiency present opportunities to enhance generative design to directly optimize expensive high-fidelity oracles and provide actionable insights to domain experts. Here, we propose Beam Enumeration to exhaustively enumerate the most probable sub-sequences from language-based molecular generative models and show that molecular substructures can be extracted. When coupled with reinforcement learning, extracted substructures become meaningful, providing a source of explainability and improving sample efficiency through self-conditioned generation. Beam Enumeration is generally applicable to any language-based molecular generative model and notably further improves the performance of the recently reported Augmented Memory algorithm, which achieved the new state-of-the-art on the Practical Molecular Optimization benchmark for sample efficiency. The combined algorithm generates more high reward molecules and faster, given a fixed oracle budget. Beam Enumeration shows that improvements to explainability and sample efficiency for molecular design can be made synergistic.
Paper Structure (41 sections, 5 equations, 15 figures, 13 tables, 2 algorithms)

This paper contains 41 sections, 5 equations, 15 figures, 13 tables, 2 algorithms.

Figures (15)

  • Figure 1: Beam Enumeration overview. a. The proposed method proceeds via 4 steps: 1. generate batch of molecules. 2. filter molecules based on pool to enforce substructure presence, discarding the rest. 3. compute reward 4. update the model. After updating the model, if the reward has improved for consecutive epochs, execute Beam Enumeration. b. Beam Enumeration sequentially enumerates the top $k$ tokens by probability for $N$ beam steps, resulting in an exhaustive set of token sub-sequences. c. All valid substructures (either by the Structure or Scaffold criterion) are extracted from the sub-sequences. The most frequent substructures are used for self-conditioned generation.
  • Figure 2: Illustrative experiment with the following multi-parameter optimization objective: maximize tPSA, molecular weight < 350 Da, number of rings $\geq$ 2. a. Augmented Memory (guo2023augmented reward trajectory with annotated top-4 (excluding benzene) most frequent molecular substructure scaffolds at varying epochs using Beam Enumeration. b. Examples of molecules with high reward.
  • Figure 3: Three drug discovery case studies showing the top generated molecule (triplicate experiments) using Augmented Memory (guo2023augmented with Beam Enumeration Structure Minimum Structure Size = 15 and the reference ligand. Extracted substructures from Beam Enumeration are highlighted. The multi-parameter optimization objective is: Minimize Vina score, maximize QED, and molecular weight < 500 Da. The values, with the Synthetic Accessibility (SA) score (ertl2009estimation are annotated. a. Dopamine type 2 receptor (wang2018structure. b. MK2 kinase (argiriadi20102. c. Acetylcholinesterase (kryger1999structure.
  • Figure B4: Beam Enumeration overview. a. The proposed method proceeds via 4 steps: 1. generate batch of molecules. 2. filter molecules based on pool to enforce substructure presence, discarding the rest. 3. compute reward 4. update the model. After updating the model, if the reward has improved for consecutive epochs, execute Beam Enumeration. b. Beam Enumeration sequentially enumerates the top $k$ tokens by probability for $N$ beam steps, resulting in an exhaustive set of token sub-sequences. c. All valid substructures (either by the Structure or Scaffold criterion) are extracted from the sub-sequences. The most frequent substructures are used for self-conditioned generation. This overview figure is the same as in the main text.
  • Figure C5: illustrative experiment Generative Yield > 0.8. The IntDiv1 (polykovskiy2020molecular is annotated.
  • ...and 10 more figures