Table of Contents
Fetching ...

Physical Consistency Bridges Heterogeneous Data in Molecular Multi-Task Learning

Yuxuan Ren, Dihan Zheng, Chang Liu, Peiran Jin, Yu Shi, Lin Huang, Jiyan He, Shengjie Luo, Tao Qin, Tie-Yan Liu

TL;DR

It is demonstrated that the more accurate energy data can improve the accuracy of structure prediction, and that consistency training can directly leverage force and off-equilibrium structure data to improve structure prediction, demonstrating a broad capability for integrating heterogeneous data.

Abstract

In recent years, machine learning has demonstrated impressive capability in handling molecular science tasks. To support various molecular properties at scale, machine learning models are trained in the multi-task learning paradigm. Nevertheless, data of different molecular properties are often not aligned: some quantities, e.g. equilibrium structure, demand more cost to compute than others, e.g. energy, so their data are often generated by cheaper computational methods at the cost of lower accuracy, which cannot be directly overcome through multi-task learning. Moreover, it is not straightforward to leverage abundant data of other tasks to benefit a particular task. To handle such data heterogeneity challenges, we exploit the specialty of molecular tasks that there are physical laws connecting them, and design consistency training approaches that allow different tasks to exchange information directly so as to improve one another. Particularly, we demonstrate that the more accurate energy data can improve the accuracy of structure prediction. We also find that consistency training can directly leverage force and off-equilibrium structure data to improve structure prediction, demonstrating a broad capability for integrating heterogeneous data.

Physical Consistency Bridges Heterogeneous Data in Molecular Multi-Task Learning

TL;DR

It is demonstrated that the more accurate energy data can improve the accuracy of structure prediction, and that consistency training can directly leverage force and off-equilibrium structure data to improve structure prediction, demonstrating a broad capability for integrating heterogeneous data.

Abstract

In recent years, machine learning has demonstrated impressive capability in handling molecular science tasks. To support various molecular properties at scale, machine learning models are trained in the multi-task learning paradigm. Nevertheless, data of different molecular properties are often not aligned: some quantities, e.g. equilibrium structure, demand more cost to compute than others, e.g. energy, so their data are often generated by cheaper computational methods at the cost of lower accuracy, which cannot be directly overcome through multi-task learning. Moreover, it is not straightforward to leverage abundant data of other tasks to benefit a particular task. To handle such data heterogeneity challenges, we exploit the specialty of molecular tasks that there are physical laws connecting them, and design consistency training approaches that allow different tasks to exchange information directly so as to improve one another. Particularly, we demonstrate that the more accurate energy data can improve the accuracy of structure prediction. We also find that consistency training can directly leverage force and off-equilibrium structure data to improve structure prediction, demonstrating a broad capability for integrating heterogeneous data.

Paper Structure

This paper contains 30 sections, 3 theorems, 15 equations, 4 figures, 14 tables, 2 algorithms.

Key Result

Proposition 1

Let $\mathbf{S}^{(\bm{\uptheta})}: \mathbb{R}^{A \times 3} \to \mathbb{R}^{A \times 3}$ be a rotationally equivariant function; that is, for any rotation matrix $\mathbf{Q} \in \mathrm{SO}(3)$ and structure $\mathbf{R} \in \mathbb{R}^{A \times 3}$, we have $\mathbf{S}^{(\bm{\uptheta})}(\mathbf{R} \m

Figures (4)

  • Figure 1: Illustration of the idea of physical consistency. To support multiple tasks ("Task X" represents a general task), the model (blue solid lines) builds multiple decoders on a shared encoder, which are trained by multi-task learning with data of respective tasks (green dotted double arrows). Physical consistency losses enforce physical laws between tasks (orange dashed double arrows), hence bridge data heterogeneity and directly improve one task from others.
  • Figure 2: Comparison of energy (eV) on the model-generated structure $\mathbf{R}_{\text{pred}}$ using the denoising method and the equilibrium structure $\mathbf{R}_{\text{eq}}$ in the PCQ dataset. Each point represents the model-predicted energy values on the two structures for one test molecule. Models are trained on (left) the PM6 dataset, (middle) the PM6 dataset and SPICE force dataset, and (right) the PM6 dataset with a subset of force labels. The closer a point lies to the diagonal line, the closer the energy of the predicted structure is to the minimum energy, indicating a closer prediction of equilibrium structure.
  • Figure C.1: Box plots for the distributions of energy prediction MAE (eV) evaluated on the PM6 structures of randomly selected 200 molecules from the intersection of PM6 and PCQ datasets.
  • Figure C.2: Histogram showing the distribution of the portion of similar (Tanimoto similarity > 0.7) molecules in PM6 (the training dataset) over the 200 PCQ test molecules. Note that the x-axis is scaled by $1{ \!\times\! 10^{-6}}$.

Theorems & Definitions (3)

  • Proposition 1
  • Lemma A.1
  • Lemma A.2