Table of Contents
Fetching ...

SAFE setup for generative molecular design

Yassir El Mesbahi, Emmanuel Noutahi

TL;DR

It is found that larger, more diverse datasets improve performance, with the LLaMA architecture using Rotary Positional Embedding proving most robust, and key factors that significantly impact the efficacy of SAFE-based generative models are highlighted.

Abstract

SMILES-based molecular generative models have been pivotal in drug design but face challenges in fragment-constrained tasks. To address this, the Sequential Attachment-based Fragment Embedding (SAFE) representation was recently introduced as an alternative that streamlines those tasks. In this study, we investigate the optimal setups for training SAFE generative models, focusing on dataset size, data augmentation through randomization, model architecture, and bond disconnection algorithms. We found that larger, more diverse datasets improve performance, with the LLaMA architecture using Rotary Positional Embedding proving most robust. SAFE-based models also consistently outperform SMILES-based approaches in scaffold decoration and linker design, particularly with BRICS decomposition yielding the best results. These insights highlight key factors that significantly impact the efficacy of SAFE-based generative models.

SAFE setup for generative molecular design

TL;DR

It is found that larger, more diverse datasets improve performance, with the LLaMA architecture using Rotary Positional Embedding proving most robust, and key factors that significantly impact the efficacy of SAFE-based generative models are highlighted.

Abstract

SMILES-based molecular generative models have been pivotal in drug design but face challenges in fragment-constrained tasks. To address this, the Sequential Attachment-based Fragment Embedding (SAFE) representation was recently introduced as an alternative that streamlines those tasks. In this study, we investigate the optimal setups for training SAFE generative models, focusing on dataset size, data augmentation through randomization, model architecture, and bond disconnection algorithms. We found that larger, more diverse datasets improve performance, with the LLaMA architecture using Rotary Positional Embedding proving most robust. SAFE-based models also consistently outperform SMILES-based approaches in scaffold decoration and linker design, particularly with BRICS decomposition yielding the best results. These insights highlight key factors that significantly impact the efficacy of SAFE-based generative models.

Paper Structure

This paper contains 25 sections, 18 figures, 2 tables.

Figures (18)

  • Figure 1: Performance of each architecture across 4 datasets and 6 representations. SMILES-based results (squares) are indicated for reference.
  • Figure 2: Performance across datasets of different sizes, measured by test loss, validity, and fragmentation percentage. RNNs' test loss results are omitted due to incomparable scale and the use of a different tokenization approach.
  • Figure 3: Effect of fragmentation algorithm on generative metrics and on the percentage of fragmented molecules, across all models, on MOSES-Full.
  • Figure 4: Effect of fragmentation algorithm on performance, across all models, when taking into account all datasets irrespective of the size.
  • Figure 5: Average performance of SAFE and SMILES sampling algorithms on scaffold decoration tasks (3 benchmarks, 29 scaffolds). SAFE models outperformed SMILES approaches. SMILES-PROMPT refers PromptSMILES, and SMILES-SAMOA to SAMOA.
  • ...and 13 more figures