Table of Contents
Fetching ...

Data-Efficient Machine learning for Predicting Dopant Formation Energies in TiO$_2$ Monolayer

Kati Asikainen, Matti Alatalo, Marko Huttula, Assa Aravindh Sasikala Devi

TL;DR

This work addresses data scarcity in predicting dopant formation energies for a 2D TiO2 monolayer by integrating DFT with a compact, descriptor-based ML framework. The approach builds a small, physics-informed dataset for Pt-doped configurations and tests chemical transferability to Ag-doped systems, achieving high predictive accuracy for Pt (R^2 up to ~0.9) and enabling transfer with limited Ag data. Key insights include the dominance of the CN_4Å_mean descriptor and the robustness of Pt predictions when additional but targeted dopant data are incorporated. Overall, the framework enables data-efficient, transferable screening of dopants in doped TiO2 monolayers, guiding design while minimizing computational cost.

Abstract

Machine learning models are increasingly applied in materials science, yet their predictive power is often constrained by data scarcity. Here, we show that accurate predictions can be achieved, even with a limited number of training examples, provided the dataset is compact and and grounded in physically relevant quantities. By combining density functional theory calculations with a machine-learning framework, we construct accurate descriptor-based models to predict the formation energies of doped lepidocrocite TiO$_2$ monolayers. The predictive accuracy of machine-learning models was first evaluated for single-dopant Pt configurations, demonstrating that the selected structural and chemical descriptors reliably capture the key factors governing dopant stability. Chemical transferability is then examined by extending the dataset to include Ag-doped configurations. Predictive accuracy improved systematically as additional Ag-doped data points were included in the training, while the performance of Pt remains robust. These results highlight the potential of small and well-curated datasets combined with physically informed descriptors to enable not only accurate but also chemically transferable machine-learning-driven screening in doped TiO$_2$ monolayer.

Data-Efficient Machine learning for Predicting Dopant Formation Energies in TiO$_2$ Monolayer

TL;DR

This work addresses data scarcity in predicting dopant formation energies for a 2D TiO2 monolayer by integrating DFT with a compact, descriptor-based ML framework. The approach builds a small, physics-informed dataset for Pt-doped configurations and tests chemical transferability to Ag-doped systems, achieving high predictive accuracy for Pt (R^2 up to ~0.9) and enabling transfer with limited Ag data. Key insights include the dominance of the CN_4Å_mean descriptor and the robustness of Pt predictions when additional but targeted dopant data are incorporated. Overall, the framework enables data-efficient, transferable screening of dopants in doped TiO2 monolayers, guiding design while minimizing computational cost.

Abstract

Machine learning models are increasingly applied in materials science, yet their predictive power is often constrained by data scarcity. Here, we show that accurate predictions can be achieved, even with a limited number of training examples, provided the dataset is compact and and grounded in physically relevant quantities. By combining density functional theory calculations with a machine-learning framework, we construct accurate descriptor-based models to predict the formation energies of doped lepidocrocite TiO monolayers. The predictive accuracy of machine-learning models was first evaluated for single-dopant Pt configurations, demonstrating that the selected structural and chemical descriptors reliably capture the key factors governing dopant stability. Chemical transferability is then examined by extending the dataset to include Ag-doped configurations. Predictive accuracy improved systematically as additional Ag-doped data points were included in the training, while the performance of Pt remains robust. These results highlight the potential of small and well-curated datasets combined with physically informed descriptors to enable not only accurate but also chemically transferable machine-learning-driven screening in doped TiO monolayer.
Paper Structure (12 sections, 2 equations, 5 figures)

This paper contains 12 sections, 2 equations, 5 figures.

Figures (5)

  • Figure 1: (a) Schematic representation of substitutional doping at the $O_b$ site in a TiO2 monolayer. (b) Per-atom formation energy of Pt-doped configurations plotted against the number of doped Pt atoms. The gray dashed line shows the least-squares fit, indicating an inverse relationship between the variables. (c) Mean absolute SHAP value of the most importance features (subset shown). The shaded coral region highlights the four features selected for ML predictions based on the recursive feature elimination of Pt-doped dataset, shown in (d).
  • Figure 2: (a) Test $R{}^2$ for Pt-doped dataset and (b) corresponding RMSE and MAE values (meV per Pt atom) before and after data addition.
  • Figure 3: DFT-calculated versus ML-predicted formation energy per Pt atom in eV for the training and test sets using (a,d) SVR, (b,e) GPR, and (c,f) LR. Left column panels (a-c) show the results using the original training set, and the right column panels (d-f) the results after expanding the training set. Black dashed line corresponds to the perfect predictions.
  • Figure 4: (a) Mean absolute SHAP value of the most importance features (subset shown) after inclusion of Ag-doped data. The seven features selected for ML predictions based on the recursive feature elimination of Pt-doped dataset, shown in (b), are highlighted with a shaded coral region. (c,e) Per-element RSME and MAE of Ag-doped dataset and (d,f) the corresponding metrics of Pt-doped dataset versus the number of Ag-doped data points used for training.
  • Figure 5: DFT-calculated versus ML-predicted formation energy per Pt atom in eV for the training and test sets using (a,d) SVR, (b,e) GPR, and (c,f) LR. Left column panels (a-c) show the results after adding 3 Ag-doped data points, and the right column panels (d-f) after adding 9 Ag-doped data points to the training set. Black dashed line corresponds to the perfect predictions. Pt and Ag-doped data points are highlighted by purple and yellow dashed squares in the top left plot, respectively.