Equivariant Masked Position Prediction for Efficient Molecular Representation
Junyi An, Chao Qu, Yun-Fei Shi, XinHao Liu, Qianwei Tang, Fenglei Cao, Yuan Qi
TL;DR
The paper tackles limited molecular data hindering generalization in GNNs by introducing Equivariant Masked Position Prediction (EMPP), a self-supervised task that predicts masked atomic positions from neighboring structure to learn quantum-mechanical features without relying on Gaussian-mixture denoising. EMPP uses SO(3)-equivariant backbones and spherical-harmonics-based direction-radius distributions to predict a masked atom's position in a well-posed manner, enabling rich 3D learning and deterministic force-related information. It provides both a pre-training mechanism (e.g., on PCQM4Mv2) and an auxiliary-task setup to boost downstream quantum-property predictions, outperforming state-of-the-art masking and denoising methods across QM9, MD17, and GEOM-Drug benchmarks. The approach yields substantial generalization gains, leverages data-generation via masking, and opens avenues for higher-order equivariant representations (e.g., $L_{max}>3$) in molecular modeling.
Abstract
Graph neural networks (GNNs) have shown considerable promise in computational chemistry. However, the limited availability of molecular data raises concerns regarding GNNs' ability to effectively capture the fundamental principles of physics and chemistry, which constrains their generalization capabilities. To address this challenge, we introduce a novel self-supervised approach termed Equivariant Masked Position Prediction (EMPP), grounded in intramolecular potential and force theory. Unlike conventional attribute masking techniques, EMPP formulates a nuanced position prediction task that is more well-defined and enhances the learning of quantum mechanical features. EMPP also bypasses the approximation of the Gaussian mixture distribution commonly used in denoising methods, allowing for more accurate acquisition of physical properties. Experimental results indicate that EMPP significantly enhances performance of advanced molecular architectures, surpassing state-of-the-art self-supervised approaches. Our code is released in https://github.com/ajy112/EMPP
