Table of Contents
Fetching ...

Transfer learning from first-principles calculations to experiments with chemistry-informed domain transformation

Yuta Yahagi, Kiichi Obuchi, Fumihiko Kosaka, Kota Matsui

TL;DR

This work tackles the bottleneck of scarce experimental data in materials science by introducing a chemistry-informed domain transformation to bridge first-principles simulations and experiments. The method first maps computational data into the experimental domain using ensemble averaging and a physics-informed conversion function, then applies standard homogeneous domain adaptation to build predictive models with high data efficiency. A RWGS catalyst activity case demonstrates positive transfer: pretraining on abundant DFT data and a small amount of experimental data yields far lower test errors than training from scratch, sometimes by an order of magnitude, while using fewer target data. The approach highlights a practical route to accelerate catalyst discovery by integrating theory, computation, and data, potentially reducing the number of laboratory experiments required.

Abstract

Simulation-to-Real (Sim2Real) transfer learning, the machine learning technique that efficiently solves a real-world task by leveraging knowledge from computational data, has received increasing attention in materials science as a promising solution to the scarcity of experimental data. We proposed an efficient transfer learning scheme from first-principles calculations to experiments based on the chemistry-informed domain transformation, that integrates the heterogeneous source and target domains by harnessing the underlying physics and chemistry. The proposed method maps the computational data from the simulation space (source domain) into the space of experimental data (target domain). During this process, these qualitatively different domains are efficiently integrated by a couple of prior knowledge of chemistry, (1) the statistical ensemble, and (2) the relationship between source and target quantities. As a proof-of-concept, we predict the catalyst activity for the reverse water-gas shift reaction by using the abundant first-principles data in addition to the experimental data. Through the demonstration, we confirmed that the transfer learning model exhibits positive transfer in accuracy and data efficiency. In particular, a significantly high accuracy was achieved despite using a few (less than ten) target data in domain transformation, whose accuracy is one order of magnitude smaller than that of a full scratch model trained with over 100 target data. This result indicates that the proposed method leverages the high prediction performance with few target data, which helps to save the number of trials in real laboratories.

Transfer learning from first-principles calculations to experiments with chemistry-informed domain transformation

TL;DR

This work tackles the bottleneck of scarce experimental data in materials science by introducing a chemistry-informed domain transformation to bridge first-principles simulations and experiments. The method first maps computational data into the experimental domain using ensemble averaging and a physics-informed conversion function, then applies standard homogeneous domain adaptation to build predictive models with high data efficiency. A RWGS catalyst activity case demonstrates positive transfer: pretraining on abundant DFT data and a small amount of experimental data yields far lower test errors than training from scratch, sometimes by an order of magnitude, while using fewer target data. The approach highlights a practical route to accelerate catalyst discovery by integrating theory, computation, and data, potentially reducing the number of laboratory experiments required.

Abstract

Simulation-to-Real (Sim2Real) transfer learning, the machine learning technique that efficiently solves a real-world task by leveraging knowledge from computational data, has received increasing attention in materials science as a promising solution to the scarcity of experimental data. We proposed an efficient transfer learning scheme from first-principles calculations to experiments based on the chemistry-informed domain transformation, that integrates the heterogeneous source and target domains by harnessing the underlying physics and chemistry. The proposed method maps the computational data from the simulation space (source domain) into the space of experimental data (target domain). During this process, these qualitatively different domains are efficiently integrated by a couple of prior knowledge of chemistry, (1) the statistical ensemble, and (2) the relationship between source and target quantities. As a proof-of-concept, we predict the catalyst activity for the reverse water-gas shift reaction by using the abundant first-principles data in addition to the experimental data. Through the demonstration, we confirmed that the transfer learning model exhibits positive transfer in accuracy and data efficiency. In particular, a significantly high accuracy was achieved despite using a few (less than ten) target data in domain transformation, whose accuracy is one order of magnitude smaller than that of a full scratch model trained with over 100 target data. This result indicates that the proposed method leverages the high prediction performance with few target data, which helps to save the number of trials in real laboratories.

Paper Structure

This paper contains 37 sections, 13 equations, 19 figures, 8 tables, 3 algorithms.

Figures (19)

  • Figure 1: Schematics of our Sim2Real transfer learning framework for materials. Here, as an example, the adsorption energy $E^{\mathrm{ads}}$ and activation energy $E^{\mathrm{act}}$ are assigned as computational and experimental quantities, respectively. See main text for further explanation.
  • Figure 2: Schematics of the chemistry-informed domain transformation. The rectangle box represent the process, rounded rectangle represent the data and its domain, the red doubled-rounded rectangle represents the chemical information (CI), respectively.
  • Figure 3: Histograms of source data (OC20) with (a) C2O1 and (b) H3N1
  • Figure 4: Histograms of target data with (a) real experimental data (Wang2023) and (b) the dummy data.
  • Figure 5: Occurrence frequency during iteration in the multiple imputation.
  • ...and 14 more figures