Pre-training via Denoising for Molecular Property Prediction

Sheheryar Zaidi; Michael Schaarschmidt; James Martens; Hyunjik Kim; Yee Whye Teh; Alvaro Sanchez-Gonzalez; Peter Battaglia; Razvan Pascanu; Jonathan Godwin

Pre-training via Denoising for Molecular Property Prediction

Sheheryar Zaidi, Michael Schaarschmidt, James Martens, Hyunjik Kim, Yee Whye Teh, Alvaro Sanchez-Gonzalez, Peter Battaglia, Razvan Pascanu, Jonathan Godwin

TL;DR

Problem: molecular property prediction from 3D structures suffers from limited labeled data. Approach: a self-supervised pre-training via denoising of large-scale equilibrium 3D structures, linked to score-matching and force-field learning, applied to GNS/GNS-TAT. Contributions: (i) a denoising pre-training objective, (ii) a force-field interpretation, (iii) state-of-the-art results on QM9 for many targets, and (iv) cross-architecture validation and analyses of data/model factors. Impact: demonstrates a practical pathway to leverage large structural corpora for 3D molecular tasks and hints at broader applicability.

Abstract

Many important problems involving molecular property prediction from 3D structures have limited data, posing a generalization challenge for neural networks. In this paper, we describe a pre-training technique based on denoising that achieves a new state-of-the-art in molecular property prediction by utilizing large datasets of 3D molecular structures at equilibrium to learn meaningful representations for downstream tasks. Relying on the well-known link between denoising autoencoders and score-matching, we show that the denoising objective corresponds to learning a molecular force field -- arising from approximating the Boltzmann distribution with a mixture of Gaussians -- directly from equilibrium structures. Our experiments demonstrate that using this pre-training objective significantly improves performance on multiple benchmarks, achieving a new state-of-the-art on the majority of targets in the widely used QM9 dataset. Our analysis then provides practical insights into the effects of different factors -- dataset sizes, model size and architecture, and the choice of upstream and downstream datasets -- on pre-training.

Pre-training via Denoising for Molecular Property Prediction

TL;DR

Abstract

Paper Structure (33 sections, 1 theorem, 11 equations, 7 figures, 10 tables)

This paper contains 33 sections, 1 theorem, 11 equations, 7 figures, 10 tables.

Introduction
Related Work
Methodology
Problem Setup
Pre-training via Denoising
Denoising as Learning a Force Field
Noisy Nodes: Denoising as an Auxiliary Loss
GNS and GNS-TAT
Experiments
Datasets and Training Setup
Results on QM9
Results on OC20
Results on DES15K
Analysis
Pre-training a Different Architecture
...and 18 more sections

Key Result

Proposition 1

The minimization objectives $J_1(\theta)$ and $J_2(\theta)$ are equivalent.

Figures (7)

Figure 1: GNS-TAT pre-trained via denoising on PCQM4Mv2 outperforms prior work on QM9.
Figure 2: Left: Frequency of compositions of molecules appearing in QM9 overlayed with the corresponding frequency in PCQM4Mv2. Each bar represents one molecular composition (e.g. one carbon atom, two oxygen atoms). Right: Percentage of elements appearing in QM9, DES15K, OC20 that also appear in PCQM4Mv2.
Figure 3: Left: Validation performance curves on the OC20 IS2RE task (ood_both split) See \ref{['tab:oc20-baselines']} for a comparison to other models in the literature. Right: Test performance curves for predicting interaction energies of dimer geometries in the DES15K dataset. "PT" and "NN" stand for pre-training and Noisy Nodes respectively.
Figure 4: Left: Impact of varying the downstream dataset size for the HOMO target in QM9 with GNS-TAT. Middle: Impact of varying the upstream dataset size for the HOMO target in QM9. Right: Validation performance curves on the OC20 S2EF task (ood_both split) for different model sizes. "PT" and "NN" stand for pre-training and Noisy Nodes respectively.
Figure 5: Training only the decoder results in significantly better performance when using pre-trained features rather than random ones.
...and 2 more figures

Theorems & Definitions (2)

Proposition 1: VincentConnection
proof

Pre-training via Denoising for Molecular Property Prediction

TL;DR

Abstract

Pre-training via Denoising for Molecular Property Prediction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (2)