Deep Learning Foundation Models from Classical Molecular Descriptors
Jackson W. Burns, Akshat Shirish Zalte, Charlles R. A. Abreu, Jochen Sieg, Christian Feldmann, Miriam Mathea, William H. Green
TL;DR
CheMeleon introduces descriptor-based pre-training for a large D-MPNN foundation model, trained to predict classical Mordred descriptors from unlabeled PubChem molecules. By learning chemically informed representations through descriptor prediction and then fine-tuning on downstream tasks, CheMeleon achieves state-of-the-art performance on Polaris and MoleculeACE benchmarks, significantly outperforming traditional baselines and other foundation models. kNN probing further shows the learned embeddings encode chemically meaningful relationships, supporting robust generalization across diverse chemical spaces. The approach offers a practical, open-source pathway to leverage deterministic, low-noise descriptors for foundation-model pre-training in chemistry, avoiding reliance on noisy experimental data or expensive quantum simulations.
Abstract
Fast and accurate data-driven prediction of molecular properties is pivotal to scientific advancements across myriad chemical domains. Deep learning methods have recently garnered much attention, despite their inability to outperform classical machine learning methods when tested on practical, real-world benchmarks with limited training data. This study seeks to bridge this gap with CheMeleon, a O(10M) parameter foundation model that enables directed message-passing neural networks to finally exceed the performance of classical methods. Evaluated on 58 benchmark datasets from Polaris and MoleculeACE, CheMeleon achieves a win rate of 75% on Polaris tasks, outperforming baselines like Random Forest (68%), fastprop (36%), and Chemprop (32%), and a 97% win rate on MoleculeACE assays, surpassing Random Forest (50%) and other foundation models. Unlike conventional pre-training approaches that rely on noisy experimental data or biased quantum mechanical simulations, CheMeleon utilizes low-noise molecular descriptors to learn rich and highly transferable molecular representations, suggesting a new avenue for foundation model pre-training.
