Pre-training with Fractional Denoising to Enhance Molecular Property Prediction

Yuyan Ni; Shikun Feng; Xin Hong; Yuancheng Sun; Wei-Ying Ma; Zhi-Ming Ma; Qiwei Ye; Yanyan Lan

Pre-training with Fractional Denoising to Enhance Molecular Property Prediction

Yuyan Ni, Shikun Feng, Xin Hong, Yuancheng Sun, Wei-Ying Ma, Zhi-Ming Ma, Qiwei Ye, Yanyan Lan

TL;DR

This work addresses data scarcity in molecular property prediction by proposing Fractional Denoising (Frad), a physics-informed self-supervised pre-training framework that decouples noise design from force-learning equivalence. Frad combines a customizable Chemical-Aware Noise (CAN) with Coordinate Gaussian Noise (CGN) so that denoising preserves a force-learning interpretation while exploring a broader energy surface via RN and VRN perturbations. The authors prove a theoretical equivalence between Frad and atomic-force learning under a Boltzmann distribution and demonstrate state-of-the-art results across atom-wise force prediction, molecule-wise QM9 properties, and complex-wise binding affinity tasks, with robustness to inaccurate conformations. Overall, Frad provides a scalable, physically grounded foundation model for molecular representation learning that improves downstream predictive performance and sampling fidelity for diverse molecular systems.

Abstract

Deep learning methods have been considered promising for accelerating molecular screening in drug discovery and material design. Due to the limited availability of labelled data, various self-supervised molecular pre-training methods have been presented. While many existing methods utilize common pre-training tasks in computer vision (CV) and natural language processing (NLP), they often overlook the fundamental physical principles governing molecules. In contrast, applying denoising in pre-training can be interpreted as an equivalent force learning, but the limited noise distribution introduces bias into the molecular distribution. To address this issue, we introduce a molecular pre-training framework called fractional denoising (Frad), which decouples noise design from the constraints imposed by force learning equivalence. In this way, the noise becomes customizable, allowing for incorporating chemical priors to significantly improve molecular distribution modeling. Experiments demonstrate that our framework consistently outperforms existing methods, establishing state-of-the-art results across force prediction, quantum chemical properties, and binding affinity tasks. The refined noise design enhances force accuracy and sampling coverage, which contribute to the creation of physically consistent molecular representations, ultimately leading to superior predictive performance.

Pre-training with Fractional Denoising to Enhance Molecular Property Prediction

TL;DR

Abstract

Paper Structure (37 sections, 9 theorems, 37 equations, 7 figures, 19 tables, 3 algorithms)

This paper contains 37 sections, 9 theorems, 37 equations, 7 figures, 19 tables, 3 algorithms.

Introduction
Results
Frad framework
Atomic forces learning interpretation
Chemical-aware noise design
Frad boosts the performances of property prediction
Atom-wise force prediction
Molecule-wise quantum chemical properties prediction
Complex-wise binding affinity prediction
Frad is robust to inaccurate conformations
Comparisons to coordinate denoising methods
Frad achieves higher force accuracy
Frad can sample farther from the equilibriums
Conclusion
Methods
...and 22 more sections

Key Result

Theorem 1

If the distribution of hybrid noise satisfies $p(x_{fin}|x_{med})\sim \mathcal{N}(x_{med},\tau^2 I_{3N})$ is a coordinate Gaussian noise (CGN), then fractional denoising is equivalent to learning the atomic forces that correspond to the approximate molecular distribution by Boltzmann Distribution. $\simeq$ denotes the equivalence as optimization objectives. eq:frad target in thm is the optimizat

Figures (7)

Figure 1: Overview of Fractional denoising (Frad). a. An illustration of the molecular conformational changes. The local structures can vibrate in small scale, while some single bonds can rotate flexibly. b. The noise-adding process in the Frad framework. A two-phase hybrid random noise is applied to the equilibrium conformation, including the chemical-aware noise (CAN) that describes the molecular conformational changes and coordinate Gaussian noise (CGN). We present two versions of CAN. c. Pre-training process of Frad. The unlabeled molecular data is processed by adding noise and then utilized as the input of the graph neural networks to predict the CGN. This task is proved to be equivalent to learning the approximate atomic forces in the molecule. d. Fine-tuning process of Frad. The GNN model inherits the pre-trained weights and continues to be updated together with a prediction head for specific downstream tasks. e. Advancements of Frad over with coordinate denoising methods (Coord) SheheryarZaidi2022PretrainingVDcoordnoneq2023ShengchaoLiu2022MolecularGPjiao2022energyluo2022onefeng2023may through the perspective of chemical priors and physical interpretations. The noise of Frad is customizable, enabling capturing both rotations and vibrations in molecular conformation changes. Frad's superior modeling of molecular distribution further enabling larger sampling coverage and more accurate force targets in the equivalent force learning task, resulting in effective pre-training and improved downstream performance. [id=2] f. An illustration of model architecture. The model primarily follows the TorchMD-NET framework, with our minor modifications highlighted in dotted orange boxes.
Figure 2: Test performance of different models on force prediction tasks in the MD17 dataset. To fairly compare with the existing methods, we assess Frad under two split settings commonly found in the literature. The top row are results using a large training data split and the bottom row are results with limited training data.
Figure Extended Data Fig.1: Extended Data Fig.0:Test performance of Coord and Frad with different data accuracy on 3 tasks in QM9. "Train from scratch" refers to the backbone model TorchMD-NET without pre-training. Both Coord and Frad use the TorchMD-NET backbone. "RDKit" and "DFT" refer to pre-training on molecular conformations generated by RDKit and DFT methods respectively. We can see that Frad is more robust to inaccurate pre-training conformations than Coord.
Figure Extended Data Fig.2: Extended Data Fig.1:Estimated force accuracy of Frad and Coord in three different sampling settings. Force accuracy is measured by Pearson correlation coefficient $\rho$ between the force estimation and ground truth. The estimation error is denoted as $C_{error}$. We can see that within the range of small estimation errors, the force estimated by Frad($\sigma>0$) is consistently more accurate than that estimated by Coord($\sigma=0$).
Figure Extended Data Fig.3: Extended Data Fig.2:Performance of Coord and Frad with different perturbation scales. The performance (MAE) is averaged across seven energy prediction tasks in QM9. The perturbation scale refers to the mean absolute coordinate changes resulting from noise application and is determined by the noise scale. For Frad, the hyperparameter is fixed at $\tau=0.04$. We can see that Frad can effectively sample farther from the equilibrium, without a significant performance drop.
...and 2 more figures

Theorems & Definitions (18)

Theorem 1: Learning Forces via Fractional Denoising
Theorem 2: Learning Forces via Coordinate Denoising SheheryarZaidi2022PretrainingVD
proof
proof
Lemma I: The equivalence between score matching and conditional score matching PascalVincent2011ACB
proof
Theorem 3: Noise type transformation
proof
Theorem 4
proof
...and 8 more

Pre-training with Fractional Denoising to Enhance Molecular Property Prediction

TL;DR

Abstract

Pre-training with Fractional Denoising to Enhance Molecular Property Prediction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (18)