Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization

Dingshuo Chen; Zhixun Li; Yuyan Ni; Guibin Zhang; Ding Wang; Qiang Liu; Shu Wu; Jeffrey Xu Yu; Liang Wang

Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization

Dingshuo Chen, Zhixun Li, Yuyan Ni, Guibin Zhang, Ding Wang, Qiang Liu, Shu Wu, Jeffrey Xu Yu, Liang Wang

TL;DR

A Molecular data Pruning framework for enhanced Generalization (MolPeg), which focuses on the source-free data pruning scenario, where data pruning is applied with pretrained models and consistently outperforms existing DP methods across four downstream tasks.

Abstract

With the emergence of various molecular tasks and massive datasets, how to perform efficient training has become an urgent yet under-explored issue in the area. Data pruning (DP), as an oft-stated approach to saving training burdens, filters out less influential samples to form a coreset for training. However, the increasing reliance on pretrained models for molecular tasks renders traditional in-domain DP methods incompatible. Therefore, we propose a Molecular data Pruning framework for enhanced Generalization (MolPeg), which focuses on the source-free data pruning scenario, where data pruning is applied with pretrained models. By maintaining two models with different updating paces during training, we introduce a novel scoring function to measure the informativeness of samples based on the loss discrepancy. As a plug-and-play framework, MolPeg realizes the perception of both source and target domain and consistently outperforms existing DP methods across four downstream tasks. Remarkably, it can surpass the performance obtained from full-dataset training, even when pruning up to 60-70% of the data on HIV and PCBA dataset. Our work suggests that the discovery of effective data-pruning metrics could provide a viable path to both enhanced efficiency and superior generalization in transfer learning.

Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization

TL;DR

Abstract

Paper Structure (32 sections, 3 theorems, 19 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 32 sections, 3 theorems, 19 equations, 6 figures, 4 tables, 1 algorithm.

Introduction
Preliminaries
Methodology
The MolPeg framework
Theoretical Understanding
Experimental Settings
Datasets and tasks
Implementation details
Empirical Studies
Empirical analysis on classification tasks
Results on QM9 dataset
Sensitivity Analysis
Conclusion
Datasets and Tasks
Computing infrastructures
...and 17 more sections

Key Result

Proposition 1

With Assumption assumption, the loss discrepancy can be approximately expressed by the dot product between the data gradient and the "EMA gradient": where ${\bm{v}}^{EMA}_t$ denotes $\sum_{j=1}^t(1-\beta)^j \nabla_\theta{\mathcal{L}}(\hat{{\mathcal{D}}}_{t-j},\theta_{t-j})$, i.e. the weighted sum of the historical gradients, which we termed as "EMA gradient".

Figures (6)

Figure 1: (Left) The performance comparison of different data pruning methods in HIV dataset under source-free data pruning setting. (Right) Distribution patterns of four important molecular features - molecular weight (MW), topological polar surface area (TPSA), Quantitative Estimate of Drug-likeness (QED) and number of bonds - in HIV AIDS:as and PCQM4Mv2 nakata2017pubchemqc dataset, which are used for pretraining and finetuning, respectively.
Figure 2: The overall framework of MolPeg. (Left) We maintain an online model and a reference model with different updating paces, which focus on the target and source domain, respectively. After model inference, the samples are scored by the absolute loss discrepancy and selected in ascending order. The easiest and hardest samples are given the largest score and selected to form the coreset. (Right) The selection process of MolPeg can be interpreted from a gradient projection perspective. Samples with low projection norms (grey) are discarded, while those with high norms are kept.
Figure 3: Performance comparison of selection criteria on HIV dataset when pruning 40% samples.
Figure 4: Performance and efficiency comparison between different DP methods. Pretrained models are fine-tuned on the HIV dataset at a 60% pruning ratio.
Figure 5: Data pruning trajectory given by downstream performance (%). Here the source models are pretrained on the PCQM4Mv2 dataset with GraphMAE and GraphCL strategies, respectively.
...and 1 more figures

Theorems & Definitions (6)

Proposition 1: Interpretation of loss discrepancy
Proposition 2: Gradient projection interpretation of MolPeg, informal
lemma 1
proof
proof
proof

Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization

TL;DR

Abstract

Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (6)