Table of Contents
Fetching ...

Automatic chemical design using a data-driven continuous representation of molecules

Rafael Gómez-Bombarelli, Jennifer N. Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, Alán Aspuru-Guzik

TL;DR

Addressing the challenge of exploring vast, discrete chemical space, the authors develop a data-driven continuous latent representation of molecules using a variational autoencoder trained on SMILES strings. They jointly train a property predictor with the autoencoder to organize the latent space by target properties and demonstrate that gradient-based and Gaussian-process-driven optimization in latent space can discover novel, drug-like molecules. The work shows that latent representations encode meaningful structural features, support interpolation, and reproduce property distributions from the training data, enabling efficient exploration beyond fixed libraries. Future directions include graph-based decoders and grammar-based SMILES to improve validity and synthetic feasibility while preserving the benefits of continuous design.

Abstract

We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds. A deep neural network was trained on hundreds of thousands of existing chemical structures to construct three coupled functions: an encoder, a decoder and a predictor. The encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the decoder converts these continuous vectors back to discrete molecular representations. The predictor estimates chemical properties from the latent continuous vector representation of the molecule. Continuous representations allow us to automatically generate novel chemical structures by performing simple operations in the latent space, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules. Continuous representations also allow the use of powerful gradient-based optimization to efficiently guide the search for optimized functional compounds. We demonstrate our method in the domain of drug-like molecules and also in the set of molecules with fewer that nine heavy atoms.

Automatic chemical design using a data-driven continuous representation of molecules

TL;DR

Addressing the challenge of exploring vast, discrete chemical space, the authors develop a data-driven continuous latent representation of molecules using a variational autoencoder trained on SMILES strings. They jointly train a property predictor with the autoencoder to organize the latent space by target properties and demonstrate that gradient-based and Gaussian-process-driven optimization in latent space can discover novel, drug-like molecules. The work shows that latent representations encode meaningful structural features, support interpolation, and reproduce property distributions from the training data, enabling efficient exploration beyond fixed libraries. Future directions include graph-based decoders and grammar-based SMILES to improve validity and synthetic feasibility while preserving the benefits of continuous design.

Abstract

We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds. A deep neural network was trained on hundreds of thousands of existing chemical structures to construct three coupled functions: an encoder, a decoder and a predictor. The encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the decoder converts these continuous vectors back to discrete molecular representations. The predictor estimates chemical properties from the latent continuous vector representation of the molecule. Continuous representations allow us to automatically generate novel chemical structures by performing simple operations in the latent space, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules. Continuous representations also allow the use of powerful gradient-based optimization to efficiently guide the search for optimized functional compounds. We demonstrate our method in the domain of drug-like molecules and also in the set of molecules with fewer that nine heavy atoms.

Paper Structure

This paper contains 11 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: (a). A diagram of the proposed autoencoder for molecular design, including the joint property prediction model. Starting from a discrete molecular representation, such as a SMILES string, the encoder network converts each molecule into a vector in the latent space, which is effectively a continuous molecular representation. Given a point in the latent space, the decoder network produces a corresponding SMILES string. Another network estimates the value of target properties associated with each molecule. (b) Gradient-based optimization in continuous latent space. After training a surrogate model $f(z)$ to predict the properties of molecules based on their latent representation $z$, we can optimize $f(z)$ with respect to $z$ to find new latent representations expected to have high values of desired properties. These new latent representations can then be decoded into SMILES strings, at which point their properties can be tested empirically.
  • Figure 2: Representations of the sampling results from the variational autoencoder. (a) Kernel Density Estimation (KDE) of each latent dimension of the autoencoder; (b) Histogram of sampled molecules for a single point in the latent space, the distances of the molecules from the original query are shown by the lines corresponding to the right axis; (c) Molecules sampled near the location of ibuprofen in latent space. The values below the molecules are the distance in latent space from the decoded molecule to ibuprofen; (d) slerp interpolation between two molecules in latent space using 6 steps of equal distance.
  • Figure 3: Two-dimensional PCA analysis of latent space for variational autoencoder. The two axis are the principle components selected from the PCA analysis, the color bar shows the value of the selected property. The first column shows the representation of all molecules from the listed dataset using autoencoders trained without joint property prediction. The second column shows the representation of molecules using an autoencoder trained with joint property prediction. The third column shows a representation of random points in the latent space of the autoencoder trained with joint property prediction; the property values predicted for these points are predicted using the property predictor network. The first three rows show the results of training on molecules from the ZINC dataset for the logP, QED, and SAS properties; the last two rows show the results of training on the QM9 dataset for the LUMO energy and the electronic spatial extent (R$^2$).
  • Figure 4: Optimization results for the jointly trained autoencoder using $5 \times$QED$-$SAS as the objective function. Part (a) shows a box plot which compares the distribution of sampled molecules from normal random sampling, SMILES optimization via a common chemical transformation with a genetic algorithm, and from optimization on the trained gaussian process model with varying levels of accuracy/training points. To offset differences in computational cost between the random search and the optimization on the gaussian process model, the results of 400 iterations of random search were compared against the results of 200 iterations of optimization. This graph shows the combined results of four sets of trials. Part (b) shows the starting and ending points of several optimization runs on a PCA plot of latent space colored by the objective functon. Higlighted in black is the path illustrated in c). Part (c) shows a spherical interpolation between the actual start and finish molecules using a constant step size. The QED, SAS, and percentile score are reported for each molecule.
  • Figure 5: Distribution and statistics of (a) the mean of latent space coordinates (b) standard deviation of latent space coordinates (c) norm of latent space coordinates of the encoded representation of randomly selected molecules from the ZINC validation set. (d) Distribution of Euclidean distances between random pairs of validation molecules in the ZINC VAE
  • ...and 3 more figures