Table of Contents
Fetching ...

Generating new coordination compounds via multireference simulations, genetic algorithms and machine learning: the case of Co(II) molecular magnets

Lion Frangoulis, Zahra Khatibi, Lorenzo A. Mariano, Alessandro Lunghi

TL;DR

The presented framework is able to generate novel organic ligands and explore chemical motifs beyond those available in pre-existing structural databases, and is able to generate novel organic ligands and explore chemical motifs beyond those available in pre-existing structural databases.

Abstract

The design of coordination compounds with target properties often requires years of continuous feedback loop between theory, simulations and experiments. In the case of magnetic molecules, this conventional strategy has indeed led to the breakthrough of single-molecule magnets with working temperatures above nitrogen's boiling point, but at significant costs in terms of resources and time. Here, we propose a computational strategy able to accelerate the discovery of new coordination compounds with desired electronic and magnetic properties. Our approach is based on a combination of high-throughput multireference ab initio methods, genetic algorithms and machine learning. While genetic algorithms allow for an intelligent sampling of the vast chemical space available, machine learning reduces the computational cost by pre-screening molecular properties in advance of their accurate and automated multireference ab initio characterization. Importantly, the presented framework is able to generate novel organic ligands and explore chemical motifs beyond those available in pre-existing structural databases. We showcase the power of this approach by automatically generating new Co(II) mononuclear coordination compounds with record magnetic properties in a fraction of the time required by either experiments or brute-force ab initio approaches

Generating new coordination compounds via multireference simulations, genetic algorithms and machine learning: the case of Co(II) molecular magnets

TL;DR

The presented framework is able to generate novel organic ligands and explore chemical motifs beyond those available in pre-existing structural databases, and is able to generate novel organic ligands and explore chemical motifs beyond those available in pre-existing structural databases.

Abstract

The design of coordination compounds with target properties often requires years of continuous feedback loop between theory, simulations and experiments. In the case of magnetic molecules, this conventional strategy has indeed led to the breakthrough of single-molecule magnets with working temperatures above nitrogen's boiling point, but at significant costs in terms of resources and time. Here, we propose a computational strategy able to accelerate the discovery of new coordination compounds with desired electronic and magnetic properties. Our approach is based on a combination of high-throughput multireference ab initio methods, genetic algorithms and machine learning. While genetic algorithms allow for an intelligent sampling of the vast chemical space available, machine learning reduces the computational cost by pre-screening molecular properties in advance of their accurate and automated multireference ab initio characterization. Importantly, the presented framework is able to generate novel organic ligands and explore chemical motifs beyond those available in pre-existing structural databases. We showcase the power of this approach by automatically generating new Co(II) mononuclear coordination compounds with record magnetic properties in a fraction of the time required by either experiments or brute-force ab initio approaches

Paper Structure

This paper contains 4 sections, 1 equation, 7 figures.

Figures (7)

  • Figure 1: Schematic representation of the genetic algorithm elements. (a) A prepared list of ligands is used as the basis for the algorithm and indexed (top left). The genome consists of the indexes of the individual ligands involved, with each index being one gene indicated by a blue box (top center). The final compound is then created based on the 3D-coordinates of the ligands, as seen on the right with the metal core and its first coordination shell neighbors highlighted. (b) The database for the dynamic encoding consists of individual operators, each being responsible for either adding an atom of a specific element and charge or creating rings or branches within the molecule (left). The genome is a list of these operators making up SELFIES, separated into parts for the individual ligands, and finally assembled into its 3D-representation presented in the right. (c) On the left, two parent genomes are shown, with their corresponding genes colored depending on the parent. Crossover for both encodings consists of the splitting of the genome of two parents, and cross assembly of the resulting fragments, as shown in the middle. Mutation (marked in red) consists of the replacement of a single ligand with another from the database for the static encoding, and the replacement or removal/addition of a single operator from the database for the dynamic encoding.
  • Figure 2: Genetic algorithm performances. (a) Comparison of minimum $D$ growth for the random sampling approach and GA with different population sizes, performed on the COMPASS test set and averaged over 1000 runs. The termination criteria is set to find one of the top three compounds in the set. The inset shows the speedup and average number of generations needed for termination of run for different population sizes. (b) Comparison of compound evaluations needed for random sampling and GA with different population sizes, versus the termination criteria—finding one of the top 1,3,5,10 or 20 compounds within the COMPASS dataset. The inset shows the speedup that GA with different population sizes enables compared to the random sampling.
  • Figure 3: Results for the GA with static encoding. a) Minimum and average anisotropy growth of the static run. The minimum curves of both runs have a very jagged growth, as expected, while the average continuously goes down. However, both reach the < -200 cm$^{-1}$ regime. (b) Temporal evolution of GA-produced compounds and their anisotropy. The GA population at each generation roughly follows a skewed normal distribution with later generation concentrating and converging towards higher $D$ values through a steady progression. (c) Accumulative distribution of compounds to compare the GA and random sampling performances, with the GA showing a higher relative population in the medium negative regime. (d) Top performing compounds. None of the top compounds preserved the original tetrahedral geometry, however, all show reasonable optimized structures.
  • Figure 4: Schematic depiction of an individual decision tree in a random forest. The individual elements of a molecular fingerprint (FP) are used at each decision node to gain information about the distinction between Class 1 and Class 0. This process creates multiple branches, forming a hierarchical structure of nodes that progressively increase class purity at each split, eventually reaching the terminal nodes (leaves). The red path highlights the decision route followed by a sample molecular fingerprint as it traverses through the tree to reach a leaf node associated with Class 1.
  • Figure 5: Random forest pre-screening results. (a) Number of molecules in Class 1 (i.e., molecules with magnetic anisotropy below the specified threshold) plotted against the threshold values investigated in the dataset. The percentages next to the markers within the plot indicate the proportion of Class 1 molecules relative to the total number of molecules. (b) Receiver Operating Characteristic (ROC) curve for the RFC model. The curve illustrates the model's performance compared to a random classifier. (c) Learning curve: The accuracy of the RFC model vs the training set size averaged over 100 calculations. The horizontal axis represents the proportion of the training set relative to the total COMPASS dataset. The shaded area represents the variability in accuracy across 100 calculations.
  • ...and 2 more figures