Table of Contents
Fetching ...

Deep Learning Foundation Models from Classical Molecular Descriptors

Jackson W. Burns, Akshat Shirish Zalte, Charlles R. A. Abreu, Jochen Sieg, Christian Feldmann, Miriam Mathea, William H. Green

TL;DR

CheMeleon introduces descriptor-based pre-training for a large D-MPNN foundation model, trained to predict classical Mordred descriptors from unlabeled PubChem molecules. By learning chemically informed representations through descriptor prediction and then fine-tuning on downstream tasks, CheMeleon achieves state-of-the-art performance on Polaris and MoleculeACE benchmarks, significantly outperforming traditional baselines and other foundation models. kNN probing further shows the learned embeddings encode chemically meaningful relationships, supporting robust generalization across diverse chemical spaces. The approach offers a practical, open-source pathway to leverage deterministic, low-noise descriptors for foundation-model pre-training in chemistry, avoiding reliance on noisy experimental data or expensive quantum simulations.

Abstract

Fast and accurate data-driven prediction of molecular properties is pivotal to scientific advancements across myriad chemical domains. Deep learning methods have recently garnered much attention, despite their inability to outperform classical machine learning methods when tested on practical, real-world benchmarks with limited training data. This study seeks to bridge this gap with CheMeleon, a O(10M) parameter foundation model that enables directed message-passing neural networks to finally exceed the performance of classical methods. Evaluated on 58 benchmark datasets from Polaris and MoleculeACE, CheMeleon achieves a win rate of 75% on Polaris tasks, outperforming baselines like Random Forest (68%), fastprop (36%), and Chemprop (32%), and a 97% win rate on MoleculeACE assays, surpassing Random Forest (50%) and other foundation models. Unlike conventional pre-training approaches that rely on noisy experimental data or biased quantum mechanical simulations, CheMeleon utilizes low-noise molecular descriptors to learn rich and highly transferable molecular representations, suggesting a new avenue for foundation model pre-training.

Deep Learning Foundation Models from Classical Molecular Descriptors

TL;DR

CheMeleon introduces descriptor-based pre-training for a large D-MPNN foundation model, trained to predict classical Mordred descriptors from unlabeled PubChem molecules. By learning chemically informed representations through descriptor prediction and then fine-tuning on downstream tasks, CheMeleon achieves state-of-the-art performance on Polaris and MoleculeACE benchmarks, significantly outperforming traditional baselines and other foundation models. kNN probing further shows the learned embeddings encode chemically meaningful relationships, supporting robust generalization across diverse chemical spaces. The approach offers a practical, open-source pathway to leverage deterministic, low-noise descriptors for foundation-model pre-training in chemistry, avoiding reliance on noisy experimental data or expensive quantum simulations.

Abstract

Fast and accurate data-driven prediction of molecular properties is pivotal to scientific advancements across myriad chemical domains. Deep learning methods have recently garnered much attention, despite their inability to outperform classical machine learning methods when tested on practical, real-world benchmarks with limited training data. This study seeks to bridge this gap with CheMeleon, a O(10M) parameter foundation model that enables directed message-passing neural networks to finally exceed the performance of classical methods. Evaluated on 58 benchmark datasets from Polaris and MoleculeACE, CheMeleon achieves a win rate of 75% on Polaris tasks, outperforming baselines like Random Forest (68%), fastprop (36%), and Chemprop (32%), and a 97% win rate on MoleculeACE assays, surpassing Random Forest (50%) and other foundation models. Unlike conventional pre-training approaches that rely on noisy experimental data or biased quantum mechanical simulations, CheMeleon utilizes low-noise molecular descriptors to learn rich and highly transferable molecular representations, suggesting a new avenue for foundation model pre-training.

Paper Structure

This paper contains 12 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Workflow for the present study. (a) A large corpus of unlabeled SMILES are randomly selected from PubChem pubchem and each is featurized into a vector of molecular descriptors using Mordred mordred. Chemprop is used to train a directed message passing neural network chemprop_theorychemprop_software (D-MPNN) to predict these descriptors using a masked loss analogous to ChemBERTa's as a form of regularization chemberta. (b) The resulting D-MPNN is then reused for subsequent fine-tuning on smaller downstream datasets labeled with quantities of interest, such as bioactivity.
  • Figure 2: Performance of all of the tested models across a set of different common molecular machine learning tasks. The origin of each benchmark set is shown as the first line of each subplot title, followed by the name of the dataset (which indicates the task), the size of the training data, and the metric used to evaluate model performance. Benchmarks are sorted by training size in decreasing order. Models shown in blue are the absolute highest performers on the given benchmark, while models shown in gray are not practically different from the best performer according to the Tukey Honestly Significant Difference test ($\alpha=0.05$) based on the variance in test set performance across five repetitions, as laid out in Section \ref{['benchmarks']}. Models shown in red are practically worse performers and are considered to have "lost" on the indicated benchmark.
  • Figure 3: Performance of models across the ChEMBL assays chembl curated as part of the MoleculeACE study moleculeace. Each marker indicates the difference in Root Mean Squared Error (RMSE) of predictions between molecules in the cliff set and those not in the cliff set (noncliff), for the specified assay. Five-fold cross validation was performed according to the procedure described in Section \ref{['statistical_comparisons']}, enabling a one-sided t-test to check if this performance difference was statistically greater than zero (confidence interval for this test is shown as horizontal error bars, $\alpha=0.05$). Markers shown in blue are not practically different from zero, a positive result indicating that the model performance on the two sets is indistinguishable. Marker filling reflects the absolute performance of the models relative to one another following the same statistical procedure as Figure \ref{['fig:polaris_hsd_selected']}. Filled markers indicate that the given model was statistically the best or indistinguishable from the best performer in terms of RMSE for the entire test set; hollow markers indicate statistically significant worse performance. Absolute, overall performances results are also presented in Supporting Information Figure \ref{['fig:mace_hsd_all']} in the style of Figure \ref{['fig:polaris_hsd_selected']}.
  • Figure 4: kNN probing of fixed and learned molecule representations. Upper row shows average results over 20 ToxCast endpoints. Lower row shows the results on the NR-ER endpoint, which was also used in the comparison by ball2020keyball2020key. The results were obtained with a 5-fold cross validation using a random split, representing the read-across scenario.