Table of Contents
Fetching ...

Augmenting Molecular Graphs with Geometries via Machine Learning Interatomic Potentials

Cong Fu, Yuchao Lin, Zachary Krueger, Haiyang Yu, Maho Nakata, Jianwen Xie, Emine Kucukbenli, Xiaofeng Qian, Shuiwang Ji

TL;DR

This work curates a large-scale molecular relaxation dataset and demonstrates that MLIP foundation models trained on relaxation data can provide valuable molecular geometries that benefit property predictions.

Abstract

Accurate molecular property predictions require 3D geometries, which are typically obtained using expensive methods such as density functional theory (DFT). Here, we attempt to obtain molecular geometries by relying solely on machine learning interatomic potential (MLIP) models. To this end, we first curate a large-scale molecular relaxation dataset comprising 3.5 million molecules and 300 million snapshots. Then MLIP pre-trained models are trained with supervised learning to predict energy and forces given 3D molecular structures. Once trained, we show that the pre-trained models can be used in different ways to obtain geometries either explicitly or implicitly. First, it can be used to obtain approximate low-energy 3D geometries via geometry optimization. While these geometries do not consistently reach DFT-level chemical accuracy or convergence, they can still improve downstream performance compared to non-relaxed structures. To mitigate potential biases and enhance downstream predictions, we introduce geometry fine-tuning based on the relaxed 3D geometries. Second, the pre-trained models can be directly fine-tuned for property prediction when ground truth 3D geometries are available. Our results demonstrate that MLIP pre-trained models trained on relaxation data can learn transferable molecular representations to improve downstream molecular property prediction and can provide practically valuable but approximate molecular geometries that benefit property predictions. Our code is publicly available at: https://github.com/divelab/AIRS/

Augmenting Molecular Graphs with Geometries via Machine Learning Interatomic Potentials

TL;DR

This work curates a large-scale molecular relaxation dataset and demonstrates that MLIP foundation models trained on relaxation data can provide valuable molecular geometries that benefit property predictions.

Abstract

Accurate molecular property predictions require 3D geometries, which are typically obtained using expensive methods such as density functional theory (DFT). Here, we attempt to obtain molecular geometries by relying solely on machine learning interatomic potential (MLIP) models. To this end, we first curate a large-scale molecular relaxation dataset comprising 3.5 million molecules and 300 million snapshots. Then MLIP pre-trained models are trained with supervised learning to predict energy and forces given 3D molecular structures. Once trained, we show that the pre-trained models can be used in different ways to obtain geometries either explicitly or implicitly. First, it can be used to obtain approximate low-energy 3D geometries via geometry optimization. While these geometries do not consistently reach DFT-level chemical accuracy or convergence, they can still improve downstream performance compared to non-relaxed structures. To mitigate potential biases and enhance downstream predictions, we introduce geometry fine-tuning based on the relaxed 3D geometries. Second, the pre-trained models can be directly fine-tuned for property prediction when ground truth 3D geometries are available. Our results demonstrate that MLIP pre-trained models trained on relaxation data can learn transferable molecular representations to improve downstream molecular property prediction and can provide practically valuable but approximate molecular geometries that benefit property predictions. Our code is publicly available at: https://github.com/divelab/AIRS/

Paper Structure

This paper contains 28 sections, 10 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Overview of the MLIP pre-trained model training pipeline. The model is pre-trained using our curated large-scale relaxation dataset, which includes atomic numbers, forces, positions, and energies for each snapshot. The pre-trained MLIP model can either be fine-tuned for molecular property prediction when stable 3D geometries are available or employed for geometry optimization to obtain 3D geometries for downstream property prediction.
  • Figure 2: Comparison of geometry optimization based on DFT and MLIP.
  • Figure 3: Overview of geometry fine-tuning. In the pre-training stage, a property predictor is trained on stable 3D molecules for property prediction. This pre-trained predictor is then fine-tuned on 3D molecular structures relaxed by the MLIP pre-trained model, with both property prediction and geometry alignment losses.
  • Figure 4: Compare fine-tuning the full pre-trained model versus training only the prediction head using pre-trained or random features.
  • Figure 5: Fine-tuning performance of the pre-trained model pre-trained with different sizes of data.
  • ...and 1 more figures