Table of Contents
Fetching ...

Efficient training of generative models from multireference simulations and its application to the design of Dy complexes with large magnetic anisotropy

Zahra Khatibi, Lorenzo A. Mariano, Lion Frangoulis, Alessandro Lunghi

Abstract

Generative machine learning models can potentially provide direct access to novel and relevant portions of the full chemical space, overcoming the cost of systematic sampling. However, the training of these models generally requires a large amount of data, often precluding the use of expensive high-level ab initio simulations for this task. The generation of coordination compounds of Dy with large magnetic anisotropy represents a topical example, where multireference simulations of large molecules are necessary to perform reliable predictions. Here, we show that a semi-supervised chemically-inspired training-by-proxy of generative variational autoencoders can reduce the cost associated with building a training set from multireference simulations by two orders of magnitude. We illustrate the power of this approach by generating 100s of new organic ligands for Dy(III) pentagonal bipyramidal complexes exhibiting record values of magnetic anisotropy, while starting from datasets as small as 1k multireference calculations. This work thus paves the way to the computational generation of molecules as complex coordination compounds with target electronic and magnetic properties.

Efficient training of generative models from multireference simulations and its application to the design of Dy complexes with large magnetic anisotropy

Abstract

Generative machine learning models can potentially provide direct access to novel and relevant portions of the full chemical space, overcoming the cost of systematic sampling. However, the training of these models generally requires a large amount of data, often precluding the use of expensive high-level ab initio simulations for this task. The generation of coordination compounds of Dy with large magnetic anisotropy represents a topical example, where multireference simulations of large molecules are necessary to perform reliable predictions. Here, we show that a semi-supervised chemically-inspired training-by-proxy of generative variational autoencoders can reduce the cost associated with building a training set from multireference simulations by two orders of magnitude. We illustrate the power of this approach by generating 100s of new organic ligands for Dy(III) pentagonal bipyramidal complexes exhibiting record values of magnetic anisotropy, while starting from datasets as small as 1k multireference calculations. This work thus paves the way to the computational generation of molecules as complex coordination compounds with target electronic and magnetic properties.
Paper Structure (7 sections, 10 figures)

This paper contains 7 sections, 10 figures.

Figures (10)

  • Figure 1: Schematic representation of a pentagonal bipyramidal Dy(III) SMM used in this work.Left The Dy ion (light blue) is coordinated by five water molecules (red= oxygen, white= hydrogen), and two axial ligands L. Right Selected examples for L (grey=carbon, blue=nitrogen).
  • Figure 2: VAE trained on ligands and its sampling performance.a Schematic representation of a VAE architecture. Input SMILES strings are converted into one-hot encodings and fed into an encoder that maps the training data into a continuous latent distribution. A reparameterization method kingma2013auto is then employed to sample the latent space, which is passed to the decoder to reconstruct the input data. After training, a random seed from the latent space can be used with LP sampling to generate multiple samples, many of which correspond to novel molecules. b Learning curve of the VAE model, showing reconstruction accuracy as a function of training set size. c Sampling rates generated using a single random seed. For each std value, 50 batches of 100 samples were generated and labeled as novel if they were not present in the training set, and unique if they were not repeated within the generated samples at each step. The accumulated novel and unique samples over time are presented as the global variables. Based on the learning curve shown in b, we selected a training set of 114k samples for the sampling experiments.
  • Figure 3: VAE trained on ligands and their structural features.a Architecture of a VAE model with guided sampling using structural information. Structural features of the associated SMM after DFT optimization, together with the corresponding latent vector, are fed into a DNN to predict the KD energy gaps. b Learning curve of the model shown in a, illustrating the convergence of the DNN R$^2$ score as the training set size increases. The R$^2$ score is defined as the ratio between the squared distance of the predictions from the target values of all KD energy gaps and the squared distance of the target values from their mean. This results in an R$^2$ vector of length 7, which is averaged for presentation in b. c Parity plot of all KD energy gaps for a model trained on a dataset in which 11k samples are labeled.
  • Figure 4: Latent space ordered by structural features.a, b The span of the latent space along two randomly selected dimensions of the 32-dimensional latent vector. The scatter plots are colored according to the predicted first and last Kramers energy gaps, obtained from a DNN that takes as input the latent vector concatenated with structural features of the fully relaxed SMM, such as the SDP and MPP descriptors. c, d show the PCA of the latent space in 2D. The model was trained on a dataset of 129k molecules, of which only 18k ligands were labeled with Kramers energy gaps. The plots shown here correspond to the full set of 23k labeled molecules, including the 18k used during training.
  • Figure 5: Kramers energy gaps of random seeds and their samples. The a first and b last KD energy gaps of the generated samples versus those of the corresponding seeds. The target KD gaps from the 23k dataset were divided into six bins, from which six batches of random seeds were selected. Each batch contained 10 unique samples that were generated using a scaled std of up to 0.2. The sampling procedure was repeated five times, yielding a total of 300 generated samples. The generated samples were subsequently assembled into Dy(III) SMMs, structurally optimized, and their Kramers gap energies were evaluated using CASSCF calculations.
  • ...and 5 more figures