Table of Contents
Fetching ...

Predicting Spin-Crossover Behavior in Metal-Organic Frameworks from Limited and Noisy Data Using Quantile Active Learning

Ashna Jose, Emilie Devijver, Martin Uhrin, Noel Jakse, Roberta Poloni

TL;DR

This work shows that spin crossover can be reliably identified from limited and imperfect data through smart training-set selection, enabling accelerated screening of SCO MOFs.

Abstract

Spin-crossover (SCO) metal-organic frameworks (MOFs) hold great promise for sensing, spintronics, and gas-related applications, however, only a small number of SCO-active examples are known among the thousands of MOFs already synthesized. Computational screening enhanced by machine learning offers a powerful route to uncover these hidden candidates much more rapidly than trial-and-error experiments. However, progress is limited by the computational complexity of obtaining accurate adiabatic energy differences, as these typically require separate geometry optimizations for both spin states, a process that is technically challenging, prone to convergence failures, and difficult to automate at scale. To mitigate these issues, we introduce a data-efficient strategy based on Quantile Regression Tree-based Active Learning, designed to navigate large chemical spaces while remaining robust to noisy and scarce labels obtained from unrelaxed geometries. After actively selecting a 200-sized subset of representative MOFs for electronic-structure evaluation, a Random Forest regressor trained on this data accurately identifies SCO-relevant candidates despite label noise, recovering 82% of true positives with only two false negatives. Applying the model to the unlabeled dataset yields a new collection of high-confidence SCO MOFs, which we denote pSCO-105. This work shows that spin crossover can be reliably identified from limited and imperfect data through smart training-set selection, enabling accelerated screening of SCO MOFs.

Predicting Spin-Crossover Behavior in Metal-Organic Frameworks from Limited and Noisy Data Using Quantile Active Learning

TL;DR

This work shows that spin crossover can be reliably identified from limited and imperfect data through smart training-set selection, enabling accelerated screening of SCO MOFs.

Abstract

Spin-crossover (SCO) metal-organic frameworks (MOFs) hold great promise for sensing, spintronics, and gas-related applications, however, only a small number of SCO-active examples are known among the thousands of MOFs already synthesized. Computational screening enhanced by machine learning offers a powerful route to uncover these hidden candidates much more rapidly than trial-and-error experiments. However, progress is limited by the computational complexity of obtaining accurate adiabatic energy differences, as these typically require separate geometry optimizations for both spin states, a process that is technically challenging, prone to convergence failures, and difficult to automate at scale. To mitigate these issues, we introduce a data-efficient strategy based on Quantile Regression Tree-based Active Learning, designed to navigate large chemical spaces while remaining robust to noisy and scarce labels obtained from unrelaxed geometries. After actively selecting a 200-sized subset of representative MOFs for electronic-structure evaluation, a Random Forest regressor trained on this data accurately identifies SCO-relevant candidates despite label noise, recovering 82% of true positives with only two false negatives. Applying the model to the unlabeled dataset yields a new collection of high-confidence SCO MOFs, which we denote pSCO-105. This work shows that spin crossover can be reliably identified from limited and imperfect data through smart training-set selection, enabling accelerated screening of SCO MOFs.
Paper Structure (21 sections, 4 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 21 sections, 4 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Schematic representation of the workflow developed in this work. The QMOF database is pre-screened for potential SCO-active MOFs, and the MOF-2184 subset is obtained. The test set is selected using a clustering-based algorithm. QRT-AL is then initialized with a randomly selected subset of MOFs, for which $\Delta E_\text{H--L}$ values are computed using DFT (AiiDA). Labeled MOFs are added to the training set, and the active learning loop is iterated until 200 MOFs are labeled. The resulting set of 276 $\Delta E_\text{H--L}$ values computed using DFT without geometrical optimization is named here cSCO-276 dataset. This set is used to train a machine learning model that predicts $\Delta E_\text{H--L}$, using which high-confidence SCO-active MOFs (pSCO-105) are identified.
  • Figure 2: (a) Overview of the pre-screening steps applied to the QMOF database to obtain the MOF-2184 subset, containing potential SCO candidates. (b) Pie chart showing the distribution of MOFs with various transition metals. (c) Periodic table showing the number of MOFs with a given element in the QMOF database, the elements that are present in the MOF-2184 subset are highlighted in red.
  • Figure 3: UMAP visualization of the MOF-2184 dataset in the ST-37 descriptor space. Five prominent clusters can be observed: four major clusters correspond to MOFs containing transition metals Fe (yellow), Co (green), Ni (blue), and Mn (red). A smaller cluster at the bottom (violet) consists of MOFs with the range of atomic number feature $\geq$ 30. Black circles indicate MOFs selected by iRDM in the test set, showing a well-distributed selection across the different clusters.
  • Figure 4: Overview of training set construction using Quantile Regression Tree–Based Active Learning (QRT-AL)
  • Figure 5: (a) Plot showing $\Delta E_{H-L,O}$, obtained using the SCO-MOF-RelaxWorkChain vs $\Delta E_\text{H--L,U}$, obtained using the SCO-MOF-SCF-WorkChain using unoptimized structures. The grey shaded region corresponds to an estimate of the values of $\Delta E_\text{H--L,O}$ that are potentially interesting for SCO, and the corresponding region for $\Delta E_\text{H-L,U}$ is shaded in blue. The diagonal indicates the $x=y$ line. (b) Distribution of the labels $\Delta E_\text{H--L,U}$ of the test set. The range of the labels is partitioned into four quantiles, Q$_1$, Q$_2$, Q$_3$ and Q$_4$, with Q$_3$ being the quantile of interest based on (a).
  • ...and 2 more figures