Transformers for molecular property prediction: Domain adaptation efficiently improves performance
Afnan Sultan, Max Rausch-Dupont, Shahrukh Khan, Olga Kalinina, Dietrich Klakow, Andrea Volkamer
TL;DR
This study questions the value of indiscriminately scaling pre-training data for molecular property prediction and shows that large unlabeled corpora yield diminishing returns. It demonstrates that domain adaptation on small, domain-specific data using a multi-task regression of physicochemical properties substantially improves downstream ADME predictions, outperforming larger models in many cases. The best-performing setup combines MLM pre-training with MTR-based domain adaptation (the MLM_MTR configuration), achieving competitive results relative to MolBERT and MolFormer while using far fewer pre-training molecules. Overall, incorporating chemically informed objectives and domain-aligned data is shown to be a more effective and efficient path for molecular property prediction than mere increases in pre-training size, with explicit physicochemical features continuing to provide strong signal for downstream tasks.
Abstract
Over the past six years, molecular transformer models have become key tools in drug discovery. Most existing models are pre-trained on large, unlabeled datasets such as ZINC or ChEMBL. However, the extent to which large-scale pre-training improves molecular property prediction remains unclear. This study evaluates transformer models for this task while addressing their limitations. We explore how pre-training dataset size and chemically informed objectives impact performance. Our results show that increasing the dataset beyond approximately 400K to 800K molecules from large-scale unlabeled databases does not enhance performance across seven datasets covering five ADME endpoints: lipophilicity, permeability, solubility (two datasets), microsomal stability (two datasets), and plasma protein binding. In contrast, domain adaptation on a small, domain-specific dataset (less than or equal 4K molecules) using multi-task regression of physicochemical properties significantly boosts performance (P-value less than 0.001). A model pre-trained on 400K molecules and adapted with domain-specific data outperforms larger models such as MolFormer and performs comparably to MolBERT. Benchmarks against Random Forest (RF) baselines using descriptors and Morgan fingerprints show that chemically and physically informed features consistently yield better performance across model types. While RF remains a strong baseline, we identify concrete practices to enhance transformer performance. Aligning pre-training and adaptation with chemically meaningful tasks and domain-relevant data presents a promising direction for molecular property prediction. Our models are available on HuggingFace for easy use and adaptation.
