Pre-training via Denoising for Molecular Property Prediction
Sheheryar Zaidi, Michael Schaarschmidt, James Martens, Hyunjik Kim, Yee Whye Teh, Alvaro Sanchez-Gonzalez, Peter Battaglia, Razvan Pascanu, Jonathan Godwin
TL;DR
Problem: molecular property prediction from 3D structures suffers from limited labeled data. Approach: a self-supervised pre-training via denoising of large-scale equilibrium 3D structures, linked to score-matching and force-field learning, applied to GNS/GNS-TAT. Contributions: (i) a denoising pre-training objective, (ii) a force-field interpretation, (iii) state-of-the-art results on QM9 for many targets, and (iv) cross-architecture validation and analyses of data/model factors. Impact: demonstrates a practical pathway to leverage large structural corpora for 3D molecular tasks and hints at broader applicability.
Abstract
Many important problems involving molecular property prediction from 3D structures have limited data, posing a generalization challenge for neural networks. In this paper, we describe a pre-training technique based on denoising that achieves a new state-of-the-art in molecular property prediction by utilizing large datasets of 3D molecular structures at equilibrium to learn meaningful representations for downstream tasks. Relying on the well-known link between denoising autoencoders and score-matching, we show that the denoising objective corresponds to learning a molecular force field -- arising from approximating the Boltzmann distribution with a mixture of Gaussians -- directly from equilibrium structures. Our experiments demonstrate that using this pre-training objective significantly improves performance on multiple benchmarks, achieving a new state-of-the-art on the majority of targets in the widely used QM9 dataset. Our analysis then provides practical insights into the effects of different factors -- dataset sizes, model size and architecture, and the choice of upstream and downstream datasets -- on pre-training.
