Table of Contents
Fetching ...

Molecule Generation with Fragment Retrieval Augmentation

Seul Lee, Karsten Kreis, Srimukh Prasad Veccham, Meng Liu, Danny Reidenbach, Saee Paliwal, Arash Vahdat, Weili Nie

TL;DR

A new fragment-based molecule generation framework with retrieval augmentation, namely Fragment Retrieval-Augmented Generation (f-RAG), based on a pre-trained molecular generative model that proposes additional fragments from input fragments to complete and generate a new molecule.

Abstract

Fragment-based drug discovery, in which molecular fragments are assembled into new molecules with desirable biochemical properties, has achieved great success. However, many fragment-based molecule generation methods show limited exploration beyond the existing fragments in the database as they only reassemble or slightly modify the given ones. To tackle this problem, we propose a new fragment-based molecule generation framework with retrieval augmentation, namely Fragment Retrieval-Augmented Generation (f-RAG). f-RAG is based on a pre-trained molecular generative model that proposes additional fragments from input fragments to complete and generate a new molecule. Given a fragment vocabulary, f-RAG retrieves two types of fragments: (1) hard fragments, which serve as building blocks that will be explicitly included in the newly generated molecule, and (2) soft fragments, which serve as reference to guide the generation of new fragments through a trainable fragment injection module. To extrapolate beyond the existing fragments, f-RAG updates the fragment vocabulary with generated fragments via an iterative refinement process which is further enhanced with post-hoc genetic fragment modification. f-RAG can achieve an improved exploration-exploitation trade-off by maintaining a pool of fragments and expanding it with novel and high-quality fragments through a strong generative prior.

Molecule Generation with Fragment Retrieval Augmentation

TL;DR

A new fragment-based molecule generation framework with retrieval augmentation, namely Fragment Retrieval-Augmented Generation (f-RAG), based on a pre-trained molecular generative model that proposes additional fragments from input fragments to complete and generate a new molecule.

Abstract

Fragment-based drug discovery, in which molecular fragments are assembled into new molecules with desirable biochemical properties, has achieved great success. However, many fragment-based molecule generation methods show limited exploration beyond the existing fragments in the database as they only reassemble or slightly modify the given ones. To tackle this problem, we propose a new fragment-based molecule generation framework with retrieval augmentation, namely Fragment Retrieval-Augmented Generation (f-RAG). f-RAG is based on a pre-trained molecular generative model that proposes additional fragments from input fragments to complete and generate a new molecule. Given a fragment vocabulary, f-RAG retrieves two types of fragments: (1) hard fragments, which serve as building blocks that will be explicitly included in the newly generated molecule, and (2) soft fragments, which serve as reference to guide the generation of new fragments through a trainable fragment injection module. To extrapolate beyond the existing fragments, f-RAG updates the fragment vocabulary with generated fragments via an iterative refinement process which is further enhanced with post-hoc genetic fragment modification. f-RAG can achieve an improved exploration-exploitation trade-off by maintaining a pool of fragments and expanding it with novel and high-quality fragments through a strong generative prior.

Paper Structure

This paper contains 44 sections, 7 equations, 8 figures, 11 tables, 1 algorithm.

Figures (8)

  • Figure 1: A radar plot of target properties.$f$-RAG strikes better balance among optimization performance, diversity, novelty, and synthesizability than the state-of-the-art techniques on the PMO benchmark gao2022sample.
  • Figure 2: The overall framework of $f$-RAG. After an initial fragment vocabulary is constructed from an existing molecule library, two types of fragments are retrieved during generation. Hard fragments are explicitly included in the newly generated molecules, while soft fragments implicitly guide the generation of new fragments. SAFE-GPT generates a molecule using hard fragments as input, while the fragment injection module in the middle of the SAFE-GPT layers injects the embeddings of soft fragments into the input embedding. After the generation, the molecule population and fragment vocabulary are updated with the newly generated molecule and its fragments, respectively. The exploration is further enhanced with genetic fragment modification, which also updates the fragment vocabulary and molecule population.
  • Figure 3: Hard fragment retrieval of $f$-RAG. With a probability of $50\%$, $f$-RAG either retrieves two arms as hard fragments for linker design (top) or one arm and one linker as hard fragments for motif extension (bottom).
  • Figure 4: The self-supervised training process of the fragment injection module of $f$-RAG.$F^{k\text{NN}}$ denotes the $k$-th most similar fragment to $F$. Using $F_1$ and $F_2$ as hard fragments, while using $F_3$ and its neighbors $\{F^{kNN}_3\}^K_{k=2}$ as soft fragments, the training objective is to predict $F^{1NN}_3$.
  • Figure 5: (a) The optimization curves in the deco_hop task of the PMO benchmark of the ablated $f$-RAGs. Solid lines denote the mean and shaded areas denote the standard deviation of 3 independent runs. (b) Overall results of the ablated $f$-RAGs. (c) Results with different values of $\delta$ of the similarity-based fragment filter.
  • ...and 3 more figures