Table of Contents
Fetching ...

Scaling Law of Sim2Real Transfer Learning in Expanding Computational Materials Databases for Real-World Predictions

Shunya Minami, Yoshihiro Hayashi, Stephen Wu, Kenji Fukumizu, Hiroki Sugisawa, Masashi Ishii, Isao Kuwajima, Kazuya Shiratori, Ryo Yoshida

TL;DR

This study demonstrates the scaling law of simulation-to-real (Sim2Real) transfer learning for several machine learning tasks in materials science, and shows that the prediction error on real systems decreases according to a power-law as the size of the computational data increases.

Abstract

To address the challenge of limited experimental materials data, extensive physical property databases are being developed based on high-throughput computational experiments, such as molecular dynamics simulations. Previous studies have shown that fine-tuning a predictor pretrained on a computational database to a real system can result in models with outstanding generalization capabilities compared to learning from scratch. This study demonstrates the scaling law of simulation-to-real (Sim2Real) transfer learning for several machine learning tasks in materials science. Case studies of three prediction tasks for polymers and inorganic materials reveal that the prediction error on real systems decreases according to a power-law as the size of the computational data increases. Observing the scaling behavior offers various insights for database development, such as determining the sample size necessary to achieve a desired performance, identifying equivalent sample sizes for physical and computational experiments, and guiding the design of data production protocols for downstream real-world tasks.

Scaling Law of Sim2Real Transfer Learning in Expanding Computational Materials Databases for Real-World Predictions

TL;DR

This study demonstrates the scaling law of simulation-to-real (Sim2Real) transfer learning for several machine learning tasks in materials science, and shows that the prediction error on real systems decreases according to a power-law as the size of the computational data increases.

Abstract

To address the challenge of limited experimental materials data, extensive physical property databases are being developed based on high-throughput computational experiments, such as molecular dynamics simulations. Previous studies have shown that fine-tuning a predictor pretrained on a computational database to a real system can result in models with outstanding generalization capabilities compared to learning from scratch. This study demonstrates the scaling law of simulation-to-real (Sim2Real) transfer learning for several machine learning tasks in materials science. Case studies of three prediction tasks for polymers and inorganic materials reveal that the prediction error on real systems decreases according to a power-law as the size of the computational data increases. Observing the scaling behavior offers various insights for database development, such as determining the sample size necessary to achieve a desired performance, identifying equivalent sample sizes for physical and computational experiments, and guiding the design of data production protocols for downstream real-world tasks.
Paper Structure (20 sections, 8 equations, 6 figures, 2 tables)

This paper contains 20 sections, 8 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Transfer learning of polymer property predictions using all-atom classical MD simulations. (a) Neural network architecture. (b) Scaling behavior of Sim2Real transfer for four different properties, namely refractive index, density, specific heat capacity ($C_{\mathrm P}$), and thermal conductivity. The horizontal axis represents the simulation data size, and the vertical axis shows the MAE averaged over 100 independent trials with 90% confidence interval calculated by performing bootstrapping sampling. The dashed line is the estimated power-law with the estimated equation given at the bottom left, and the horizontal red line indicates the mean MAE for direct learning with no pretraining.
  • Figure 2: Multidimensional scaling of Sim2Real transfer learning, illustrated by the density prediction of amorphous polymers. (a) Scaling to increase the amount of simulation data across various experimental dataset sizes, and (b) scaling to increase the amount of experimental data for different sizes of simulation datasets. Each line represents the MAE averaged over 500 independent trials.
  • Figure 3: Model architecture of Sim2Real multitask learning used for predicting the Flory--Huggins $\chi$ parameter.
  • Figure 4: Scaling law observed in the Flory--Huggins $\chi$ parameter prediction task. (a) Scaling behavior when increasing the simulation dataset size. The horizontal axis represents the number of polymer--solvent pairs used as the simulation dataset, and the vertical axis shows the average MAE of 100 independent trials with 90% confidence interval calculated via bootstrapping. The dashed line is the estimated power-law with the estimated equation given at the bottom left, and the horizontal red line indicates the average MAE for direct learning without pretraining. (b) Scaling behaviors across different sizes of experimental data, and (c) scaling to increase the experimental dataset for different simulation dataset sizes. Each line shows the average MAE over 100 trials.
  • Figure 5: (a) Observation of Sim2Real scaling behaviors for different polymer classes in the $\chi$ parameter prediction task. Test instances of polymer--solvent pairs were classified into 11 classes based on structural features. The $m$ value is denoted in the upper-right corner of each panel. (b) Predictive capability of COSMO-RS simulations (horizontal axis) against experimental values (vertical axis) for each of the 11 polymer classes in the $\chi$ parameter predictions.
  • ...and 1 more figures