Table of Contents
Fetching ...

Transformers for molecular property prediction: Domain adaptation efficiently improves performance

Afnan Sultan, Max Rausch-Dupont, Shahrukh Khan, Olga Kalinina, Dietrich Klakow, Andrea Volkamer

TL;DR

This study questions the value of indiscriminately scaling pre-training data for molecular property prediction and shows that large unlabeled corpora yield diminishing returns. It demonstrates that domain adaptation on small, domain-specific data using a multi-task regression of physicochemical properties substantially improves downstream ADME predictions, outperforming larger models in many cases. The best-performing setup combines MLM pre-training with MTR-based domain adaptation (the MLM_MTR configuration), achieving competitive results relative to MolBERT and MolFormer while using far fewer pre-training molecules. Overall, incorporating chemically informed objectives and domain-aligned data is shown to be a more effective and efficient path for molecular property prediction than mere increases in pre-training size, with explicit physicochemical features continuing to provide strong signal for downstream tasks.

Abstract

Over the past six years, molecular transformer models have become key tools in drug discovery. Most existing models are pre-trained on large, unlabeled datasets such as ZINC or ChEMBL. However, the extent to which large-scale pre-training improves molecular property prediction remains unclear. This study evaluates transformer models for this task while addressing their limitations. We explore how pre-training dataset size and chemically informed objectives impact performance. Our results show that increasing the dataset beyond approximately 400K to 800K molecules from large-scale unlabeled databases does not enhance performance across seven datasets covering five ADME endpoints: lipophilicity, permeability, solubility (two datasets), microsomal stability (two datasets), and plasma protein binding. In contrast, domain adaptation on a small, domain-specific dataset (less than or equal 4K molecules) using multi-task regression of physicochemical properties significantly boosts performance (P-value less than 0.001). A model pre-trained on 400K molecules and adapted with domain-specific data outperforms larger models such as MolFormer and performs comparably to MolBERT. Benchmarks against Random Forest (RF) baselines using descriptors and Morgan fingerprints show that chemically and physically informed features consistently yield better performance across model types. While RF remains a strong baseline, we identify concrete practices to enhance transformer performance. Aligning pre-training and adaptation with chemically meaningful tasks and domain-relevant data presents a promising direction for molecular property prediction. Our models are available on HuggingFace for easy use and adaptation.

Transformers for molecular property prediction: Domain adaptation efficiently improves performance

TL;DR

This study questions the value of indiscriminately scaling pre-training data for molecular property prediction and shows that large unlabeled corpora yield diminishing returns. It demonstrates that domain adaptation on small, domain-specific data using a multi-task regression of physicochemical properties substantially improves downstream ADME predictions, outperforming larger models in many cases. The best-performing setup combines MLM pre-training with MTR-based domain adaptation (the MLM_MTR configuration), achieving competitive results relative to MolBERT and MolFormer while using far fewer pre-training molecules. Overall, incorporating chemically informed objectives and domain-aligned data is shown to be a more effective and efficient path for molecular property prediction than mere increases in pre-training size, with explicit physicochemical features continuing to provide strong signal for downstream tasks.

Abstract

Over the past six years, molecular transformer models have become key tools in drug discovery. Most existing models are pre-trained on large, unlabeled datasets such as ZINC or ChEMBL. However, the extent to which large-scale pre-training improves molecular property prediction remains unclear. This study evaluates transformer models for this task while addressing their limitations. We explore how pre-training dataset size and chemically informed objectives impact performance. Our results show that increasing the dataset beyond approximately 400K to 800K molecules from large-scale unlabeled databases does not enhance performance across seven datasets covering five ADME endpoints: lipophilicity, permeability, solubility (two datasets), microsomal stability (two datasets), and plasma protein binding. In contrast, domain adaptation on a small, domain-specific dataset (less than or equal 4K molecules) using multi-task regression of physicochemical properties significantly boosts performance (P-value less than 0.001). A model pre-trained on 400K molecules and adapted with domain-specific data outperforms larger models such as MolFormer and performs comparably to MolBERT. Benchmarks against Random Forest (RF) baselines using descriptors and Morgan fingerprints show that chemically and physically informed features consistently yield better performance across model types. While RF remains a strong baseline, we identify concrete practices to enhance transformer performance. Aligning pre-training and adaptation with chemically meaningful tasks and domain-relevant data presents a promising direction for molecular property prediction. Our models are available on HuggingFace for easy use and adaptation.

Paper Structure

This paper contains 21 sections, 3 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: Data summary and distributions of the seven investigated ADME endpoints. The boxplots are shown per histogram to highlight the location of the different quartiles and datapoints that would be seen as outliers. The plots with two distributions show the presence of censored labels which are not suitable for being used directly in a regression model. Therefore, the two distributions show the original data (blue) and the actual data (orange) used during evaluation.
  • Figure 2: An overview of this research's workflow. Transformer models are trained by pre-training on generic large unlabeled datasets using one or more objectives (step 1), followed by fine-tuning on labeled datasets (step 2). Domain adaptation is an optional intermediate step that resembles pre-training, but can be done on much smaller unlabeled dataset (step 1.5).
  • Figure 3: MAE performance for increasing pre-training dataset sizes using Butina splitting. 0% corresponds to a randomly initialized model with no pre-training and 100% correspond to the $\sim 1.3$M molecules of the GuacaMol dataset. Two-tailed significance analyses were performed, therefore, the arrows in the heatmap helps recognizing the model with the improved performance. CI = confidence interval for the estimation of the mean.
  • Figure 4: $R^2$ performance for increasing pre-training dataset sizes using Butina splitting. 0% corresponds to a randomly initialized model with no pre-training and 100% correspond to the $\sim 1.3$M molecules of the GuacaMol dataset. Two-tailed significance analyses were performed, therefore, the arrows in the heatmap helps recognizing the model with the improved performance. CI = confidence interval for the estimation of the mean.
  • Figure 5: MAE performance for a baseline model trained with pre-training only (No DA), and three models incorporating domain adaptation (DA) using different objectives: Masked Language Modeling (MLM), Contrastive Learning (CL), and Multi-task Regression (MTR) for physicochemical properties. P-values are from one-tailed paired t-tests comparing each DA model to the No DA baseline, under the hypothesis that DA improves performance. Significance levels: * $p < 0.05$, ** $p < 0.01$, *** $p < 0.001$.
  • ...and 10 more figures