Table of Contents
Fetching ...

NEBULA: Neural Empirical Bayes Under Latent Representations for Efficient and Controllable Design of Molecular Libraries

Ewa M. Nowara, Pedro O. Pinheiro, Sai Pooja Mahajan, Omar Mahmood, Andrew Martin Watkins, Saeed Saremi, Michael Maser

TL;DR

NEBULA tackles the challenge of efficiently generating large, seed-conditioned molecular libraries in 3D space. It introduces a latent NEB sampling framework that compresses voxel densities with a VQ-VAE and denoises latent embeddings, enabling rapid exploration around seed molecules via Walk Jump Sampling. The method achieves up to an order-of-magnitude speedup over prior 3D generative models while preserving molecule validity and seed scaffolds, and it generalizes to unseen chemical spaces, including recently released drugs. This work has practical impact for ML-driven drug discovery by expanding searchable chemical space around lead compounds, with public code available.

Abstract

We present NEBULA, the first latent 3D generative model for scalable generation of large molecular libraries around a seed compound of interest. Such libraries are crucial for scientific discovery, but it remains challenging to generate large numbers of high quality samples efficiently. 3D-voxel-based methods have recently shown great promise for generating high quality samples de novo from random noise (Pinheiro et al., 2023). However, sampling in 3D-voxel space is computationally expensive and use in library generation is prohibitively slow. Here, we instead perform neural empirical Bayes sampling (Saremi & Hyvarinen, 2019) in the learned latent space of a vector-quantized variational autoencoder. NEBULA generates large molecular libraries nearly an order of magnitude faster than existing methods without sacrificing sample quality. Moreover, NEBULA generalizes better to unseen drug-like molecules, as demonstrated on two public datasets and multiple recently released drugs. We expect the approach herein to be highly enabling for machine learning-based drug discovery. The code is available at https://github.com/prescient-design/nebula

NEBULA: Neural Empirical Bayes Under Latent Representations for Efficient and Controllable Design of Molecular Libraries

TL;DR

NEBULA tackles the challenge of efficiently generating large, seed-conditioned molecular libraries in 3D space. It introduces a latent NEB sampling framework that compresses voxel densities with a VQ-VAE and denoises latent embeddings, enabling rapid exploration around seed molecules via Walk Jump Sampling. The method achieves up to an order-of-magnitude speedup over prior 3D generative models while preserving molecule validity and seed scaffolds, and it generalizes to unseen chemical spaces, including recently released drugs. This work has practical impact for ML-driven drug discovery by expanding searchable chemical space around lead compounds, with public code available.

Abstract

We present NEBULA, the first latent 3D generative model for scalable generation of large molecular libraries around a seed compound of interest. Such libraries are crucial for scientific discovery, but it remains challenging to generate large numbers of high quality samples efficiently. 3D-voxel-based methods have recently shown great promise for generating high quality samples de novo from random noise (Pinheiro et al., 2023). However, sampling in 3D-voxel space is computationally expensive and use in library generation is prohibitively slow. Here, we instead perform neural empirical Bayes sampling (Saremi & Hyvarinen, 2019) in the learned latent space of a vector-quantized variational autoencoder. NEBULA generates large molecular libraries nearly an order of magnitude faster than existing methods without sacrificing sample quality. Moreover, NEBULA generalizes better to unseen drug-like molecules, as demonstrated on two public datasets and multiple recently released drugs. We expect the approach herein to be highly enabling for machine learning-based drug discovery. The code is available at https://github.com/prescient-design/nebula
Paper Structure (20 sections, 8 equations, 15 figures, 2 tables)

This paper contains 20 sections, 8 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Overview of the proposed latent generative model, NEBULA. A 3D molecular graph is represented as a voxel grid and is passed through a VQ-VAE encoder to obtain latent embeddings. Noise is added and denoised by a latent U-Net, which is used to generative sampling. Denoised latents are passed through a VQ-VAE decoder to reconstruct the voxel grid, and the molecular graph is obtained via peak finding and sanitization.
  • Figure 2: Molecular stability and Tanimoto similarity over WJS steps for molecules generated with NEBULA and VoxMol on GEOM (top) and PCQM (middle). (bottom) Scalability of each method plotted as the amount of time needed to generate one molecule at different WJS steps.
  • Figure 3: Seeded generation on GEOM with NEBULA and VoxMol at different WJS steps with the corresponding voxels. Both methods can generate molecules close to the seed in within-dataset generation.
  • Figure 4: Seeded generation on PCQM with NEBULA and VoxMol at different WJS steps with the corresponding voxels. NEBULA is able to generate molecules that maintain the seed scaffold in all cases in cross-dataset generation, while VoxMol tends to diverge from the seed compound.
  • Figure 5: Seeded Generation on real drugs just released in March 2024 ACS with NEBULA at different WJS steps.
  • ...and 10 more figures