Table of Contents
Fetching ...

ChemFM as a Scaling Law Guided Foundation Model Pre-trained on Informative Chemicals

Feiyang Cai, Katelin Zacour, Tianyu Zhu, Tzuen-Rong Tzeng, Yongping Duan, Ling Liu, Srikanth Pilla, Gang Li, Feng Luo

TL;DR

ChemFM presents a $3$-billion-parameter chemical foundation model trained on $178$ million UniChem molecules using self-supervised causal language modeling, with a focus on scaling laws to identify UniChem as the more informative pre-training corpus compared with ZINC20. The model demonstrates strong generalization across 34 property-prediction datasets, conditional molecule generation, and reaction prediction, while enabling data-efficient fine-tuning via LoRA. Key findings include high unconditional-generation validity and novelty, substantial improvements over state-of-the-art on MoleculeNet and ADMET benchmarks, and competitive or superior performance on antibiotic discovery and USPTO reaction tasks. The work highlights the potential of a single, unified chemical foundation model to generalize across diverse chemistries with modest labeled data, though it notes limitations in exploration of chemical space and inference speed, pointing to directions like distillation and broader downstream tasks for future work.

Abstract

Traditional AI methods often rely on task-specific model designs and training, which constrain both the scalability of model size and generalization across different tasks. Here, we introduce ChemFM, a large foundation model specifically developed for chemicals. By conducting a series of scaling experiments, we identify UniChem as the informative molecular database for pre-training the foundation model. ChemFM comprises 3 billion parameters and is pre-trained on 178 million molecules using self-supervised causal language modeling to extract generalizable molecular representations. This model can be adapted to diverse downstream chemical applications using either full-parameter or parameter-efficient fine-tuning methods. ChemFM consistently outperforms state-of-the-art task-specific AI models across all tested tasks. Notably, it achieves up to 67.48% performance improvement across 34 property prediction benchmarks, up to 33.80% reduction in mean average deviation between conditioned and actual properties of generated molecules in conditional molecular generation tasks, and up to 3.7% top-1 accuracy improvement across 4 reaction prediction datasets. Moreover, ChemFM demonstrates its superior performance in predicting antibiotic activity and cytotoxicity, highlighting its potential to advance the discovery of novel antibiotics. Furthermore, we demonstrate that, as a foundation model, ChemFM exhibits strong data efficiency, requiring significantly fewer labeled training samples to achieve state-of-the-art performance. We anticipate that ChemFM will significantly advance chemistry research by providing a foundation model capable of effectively generalizing across a broad range of tasks with minimal additional training.

ChemFM as a Scaling Law Guided Foundation Model Pre-trained on Informative Chemicals

TL;DR

ChemFM presents a -billion-parameter chemical foundation model trained on million UniChem molecules using self-supervised causal language modeling, with a focus on scaling laws to identify UniChem as the more informative pre-training corpus compared with ZINC20. The model demonstrates strong generalization across 34 property-prediction datasets, conditional molecule generation, and reaction prediction, while enabling data-efficient fine-tuning via LoRA. Key findings include high unconditional-generation validity and novelty, substantial improvements over state-of-the-art on MoleculeNet and ADMET benchmarks, and competitive or superior performance on antibiotic discovery and USPTO reaction tasks. The work highlights the potential of a single, unified chemical foundation model to generalize across diverse chemistries with modest labeled data, though it notes limitations in exploration of chemical space and inference speed, pointing to directions like distillation and broader downstream tasks for future work.

Abstract

Traditional AI methods often rely on task-specific model designs and training, which constrain both the scalability of model size and generalization across different tasks. Here, we introduce ChemFM, a large foundation model specifically developed for chemicals. By conducting a series of scaling experiments, we identify UniChem as the informative molecular database for pre-training the foundation model. ChemFM comprises 3 billion parameters and is pre-trained on 178 million molecules using self-supervised causal language modeling to extract generalizable molecular representations. This model can be adapted to diverse downstream chemical applications using either full-parameter or parameter-efficient fine-tuning methods. ChemFM consistently outperforms state-of-the-art task-specific AI models across all tested tasks. Notably, it achieves up to 67.48% performance improvement across 34 property prediction benchmarks, up to 33.80% reduction in mean average deviation between conditioned and actual properties of generated molecules in conditional molecular generation tasks, and up to 3.7% top-1 accuracy improvement across 4 reaction prediction datasets. Moreover, ChemFM demonstrates its superior performance in predicting antibiotic activity and cytotoxicity, highlighting its potential to advance the discovery of novel antibiotics. Furthermore, we demonstrate that, as a foundation model, ChemFM exhibits strong data efficiency, requiring significantly fewer labeled training samples to achieve state-of-the-art performance. We anticipate that ChemFM will significantly advance chemistry research by providing a foundation model capable of effectively generalizing across a broad range of tasks with minimal additional training.

Paper Structure

This paper contains 40 sections, 5 equations, 10 figures, 22 tables.

Figures (10)

  • Figure 1: Pre-training and unconditional molecular generation benchmarking of ChemFM models.a, Pre-processing pipeline for ChemFM's pre-training dataset. The pipeline starts with $178$ million molecules from the UniChem database, initially represented by International Chemical Identifier (InChI) inchi. These InChIs are converted into canonical SMILES strings using RDKitrdkit. The SMILES strings are then augmented tenfold through the SMILES enumeration techniquesmilesaug, resulting in approximately $1.78$ billion SMILES strings for use as the pre-training dataset. b, Pre-training process for ChemFM. SMILES strings are segmented, tokenized, and terminated with an end token. These tokens are fed into ChemFM, a causal decoder-only transformer. Pre-training uses self-supervised causal language modeling, where the task is to predict each token based on preceding tokens. c, Pre-training performance of ChemFM-1B and ChemFM-3B models, measured by perplexity (exponentiated average negative log-likelihood) on the validation set. Models are trained through $818$ billion tokens, slightly exceeding one epoch. d, Unconditional generation benchmarking for ChemFM-3B. A total of $100000.0$ molecules are generated randomly using a temperature setting of $1.0$. The validity, uniqueness, and novelty scores of the generated molecules are reported. Additionally, internal diversity metrics (IntDiv$_1$, IntDiv$_2$) assess the diversity of the generated molecules, while KL similarity (KLSim) evaluates how closely the distribution of generated molecules aligns with that of the training dataset.
  • Figure 2: Illustrations of fine-tuning ChemFM model for downstream tasks.a, Property prediction fine-tuning. During fine-tuning, the SMILES strings of molecules are augmented with a probability of $1.0$ and tokenized before input to ChemFM. An MLP layer is added to the final token's hidden state in the final layer to handle single or multiple regression or classification tasks. For inference, the canonical SMILES is input into ChemFM to predict the desired properties. b, Conditional molecular generation fine-tuning. This task is also framed as a sequence-to-sequence problem. The input comprises a sequence of conditions, each initiated by a unique property identification token followed by single or multiple tokens representing the property values. Classification values are encoded as special tokens corresponding to their class indices, continuous values are normalized and mapped into the embedding space using a shared MLP, and scaffolds are represented by their SMILES and tokenized into sequences. During fine-tuning, the target molecules are augmented with a probability of $1.0$. c, Reaction prediction fine-tuning for both forward synthesis and retro-synthesis. These tasks are approached as sequence-to-sequence problems, where the model predicts the product (or reactant) sequence based on the reactant (or product) sequence. The root-aligned SMILES technique rsmiles is employed, aligning both sequences using the same root atom and augmenting them by enumerating different atoms as roots.
  • Figure 3: Performance comparison on 12 MoleculeNetmoleculenet benchmark datasets for molecular property prediction. All methods were evaluated using the same datasets, where we employed identical splitting methods and random seeds for data splitting, ensuring that train/validation/test data are the same for each data fold. Results for ChemFM (mean and standard deviation) are reported over three runs with different dataset folds, except for BBBP, BACE, and PDBbind-full, where only a single fold is provided in the original dataset. Values for models other than ChemFM are sourced from MMNB paperMMNB. Metrics for classification tasks included ROC-AUC or PRC-AUC, while regression tasks were evaluated using RMSE. An upward arrow ($\uparrow$) indicates that higher values are better, while a downward arrow ($\downarrow$) indicates that lower values are better. An empty bar (Chemprop method in the BACE dataset) indicates that the result was not reported in the original paper. Empty standard deviation bars occur when only a single data fold is available.
  • Figure S1.1: Comparison of chemical language model pre-training on the UniChemunichem_methods and ZINC20zinc20_methods datasets.a, c, Validation loss trajectories for models trained on the UniChem (a) and ZINC20 (c) datasets using varying model sizes. The models compared here range from approximately $10$M to $200$M parameters, excluding embeddings. b, For the UniChem dataset, the non-embedding parameters ($N$) and validation loss ($L$) closely adhere to an exponential scaling law. However, as model sizes increase to $1$B parameters (ChemFM-1B) and further to $3$B parameters (ChemFM-3B), the validation loss begins to deviate from the expected power law, suggesting that the performance gains from further increases in parameter size are approaching saturation. d, In contrast, for the ZINC20 dataset, validation loss reaches saturation when parameter size exceeds $60$M.
  • Figure S1.2: Comparison of physicochemical descriptor distributions between training and generated molecules. The descriptors were computed for 178 million molecules in the training dataset and 100000.0 molecules randomly sampled from the ChemFM-3B model, using RDKitrdkit_methods. The descriptors are: a, BertzCT, a topological index quantifying molecular complexity; b, MolLogP, the octanol-water partition coefficient; c, MolWt, molecular weight; d, TPSA, topological polar surface area; e, NumHAcceptors, number of hydrogen bond acceptors; f, NumHDonors, number of hydrogen bond donors; g, NumRotatableBonds, number of rotatable single bonds; h, NumAliphaticRings, number of aliphatic (non-aromatic) rings; i, NumAromaticRings, number of aromatic rings.
  • ...and 5 more figures