Table of Contents
Fetching ...

Towards Data-Efficient Pretraining for Atomic Property Prediction

Yasir Ghunaim, Hasan Abed Al Kader Hammoud, Bernard Ghanem

TL;DR

The paper tackles the question of whether scaling data and compute is the only path to progress in atomic property prediction. It proposes a data-efficient pretraining framework powered by the Chemical Similarity Index (CSI), an FID-inspired metric that quantifies alignment between upstream and downstream datasets to guide dataset selection. Empirically, a single high-quality upstream dataset guided by CSI often matches or outperforms large, mixed pretraining at a fraction of the cost, while indiscriminate data addition can harm performance. These findings offer a practical, scalable alternative to data and compute escalation and highlight the importance of dataset relevance for downstream molecular predictions.

Abstract

This paper challenges the recent paradigm in atomic property prediction that links progress to growing dataset sizes and computational resources. We show that pretraining on a carefully selected, task-relevant dataset can match or even surpass large-scale pretraining, while using as little as 1/24th of the computational cost. We introduce the Chemical Similarity Index (CSI), a novel metric inspired by computer vision's Fréchet Inception Distance, for molecular graphs which quantifies the alignment between upstream pretraining datasets and downstream tasks. By selecting the most relevant dataset with minimal CSI distance, we show that models pretrained on a smaller, focused dataset consistently outperform those pretrained on massive, mixed datasets such as JMP, even when those larger datasets include the relevant dataset. Counterintuitively, we also find that indiscriminately adding more data can degrade model performance when the additional data poorly aligns with the task at hand. Our findings highlight that quality often outperforms quantity in pretraining for atomic property prediction.

Towards Data-Efficient Pretraining for Atomic Property Prediction

TL;DR

The paper tackles the question of whether scaling data and compute is the only path to progress in atomic property prediction. It proposes a data-efficient pretraining framework powered by the Chemical Similarity Index (CSI), an FID-inspired metric that quantifies alignment between upstream and downstream datasets to guide dataset selection. Empirically, a single high-quality upstream dataset guided by CSI often matches or outperforms large, mixed pretraining at a fraction of the cost, while indiscriminate data addition can harm performance. These findings offer a practical, scalable alternative to data and compute escalation and highlight the importance of dataset relevance for downstream molecular predictions.

Abstract

This paper challenges the recent paradigm in atomic property prediction that links progress to growing dataset sizes and computational resources. We show that pretraining on a carefully selected, task-relevant dataset can match or even surpass large-scale pretraining, while using as little as 1/24th of the computational cost. We introduce the Chemical Similarity Index (CSI), a novel metric inspired by computer vision's Fréchet Inception Distance, for molecular graphs which quantifies the alignment between upstream pretraining datasets and downstream tasks. By selecting the most relevant dataset with minimal CSI distance, we show that models pretrained on a smaller, focused dataset consistently outperform those pretrained on massive, mixed datasets such as JMP, even when those larger datasets include the relevant dataset. Counterintuitively, we also find that indiscriminately adding more data can degrade model performance when the additional data poorly aligns with the task at hand. Our findings highlight that quality often outperforms quantity in pretraining for atomic property prediction.

Paper Structure

This paper contains 24 sections, 8 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Pretraining on a High-Quality, Task-Relevant Dataset. Pretraining on a carefully selected high-quality dataset achieves comparable or superior mean absolute error (MAE) across tasks while reducing computational cost by a factor of 24 compared to JMP-S, which is pretrained on all upstream datasets. Lower MAE indicates better performance.
  • Figure 2: Pipeline Overview. Our paradigm for pretraining and finetuning consists of two new components: (1) Dataset Selection Stage, where a distance metric $\delta$ is employed to identify the dataset that is most similar to our downstream task dataset $\mathcal{D}_d$, in this case $\mathcal{D}_u^{(1)}$. This selected dataset is then used for pretraining the model. (2) Limited Budget Pretraining, where we impose a training budget by subsampling $\mathcal{N}$ random samples from $\mathcal{D}_u^{(1)}$ and training the model for $\mathcal{E}$ epochs. This results in a computational budget of $\mathcal{C} = \mathcal{E} \times \mathcal{N}$. The pretrained backbone $\theta_b^{(1)*}$ is subsequently finetuned on the downstream task dataset $\mathcal{D}_d$ to obtain the final model parameters $\theta_d^*$.
  • Figure 3: Alignment Between Upstream and Downstream Using CSI. We assess how well the extracted representations from each upstream dataset align with downstream tasks using our CSI metric, where lower values indicate stronger alignment. ANI-1x demonstrates the closest feature alignment with downstream tasks, whereas OC20 and OC22 show the weakest alignment.
  • Figure 4: Impact of Adding Less Relevant Pretraining Data. Adding $1M$ OC22 samples to a $2M$-sample ANI-1x baseline worsens downstream performance despite a larger pretraining budget. This highlights the importance of dataset relevance and the CSI metric for effective pretraining.
  • Figure 5: CSI Between Upstream and OOD Downstream Tasks. CSI values predict that ANI-1x is the best pretraining choice for QMOF, while OC20 and OC22 are best for MatBench.
  • ...and 4 more figures