Table of Contents
Fetching ...

GP-MoLFormer: A Foundation Model For Molecular Generation

Jerret Ross, Brian Belgodere, Samuel C. Hoffman, Vijil Chenthamarakshan, Jiri Navratil, Youssef Mroueh, Payel Das

TL;DR

GP-MoLFormer is a decoder-only transformer trained on up to $1.1$B SMILES strings, enabling unconditional de novo generation and targeted molecular design. It introduces pair-tuning, a parameter-efficient soft-prompt approach that converts a seed molecule into more optimal candidates using ordered molecule pairs, without updating the base model. The method achieves competitive or superior performance across de novo generation, scaffold decoration, and three property optimization tasks (penalized logP, QED, DRD2 activity), while revealing memorization and data-bias effects at scale. The study also uncovers an inference-compute vs novelty scaling law and discusses implications for training-data quality and evaluation in large-scale chemical language models.

Abstract

Transformer-based models trained on large and general purpose datasets consisting of molecular strings have recently emerged as a powerful tool for successfully modeling various structure-property relations. Inspired by this success, we extend the paradigm of training chemical language transformers on large-scale chemical datasets to generative tasks in this work. Specifically, we propose GP-MoLFormer, an autoregressive molecular string generator that is trained on more than 1.1B (billion) chemical SMILES. GP-MoLFormer uses a 46.8M parameter transformer decoder model with linear attention and rotary positional encodings as the base architecture. GP-MoLFormer's utility is evaluated and compared with that of existing baselines on three different tasks: de novo generation, scaffold-constrained molecular decoration, and unconstrained property-guided optimization. While the first two are handled with no additional training, we propose a parameter-efficient fine-tuning method for the last task, which uses property-ordered molecular pairs as input. We call this new approach pair-tuning. Our results show GP-MoLFormer performs better or comparable with baselines across all three tasks, demonstrating its general utility for a variety of molecular generation tasks. We further report strong memorization of training data in GP-MoLFormer generations, which has so far remained unexplored for chemical language models. Our analyses reveal that training data memorization and novelty in generations are impacted by the quality and scale of the training data; duplication bias in training data can enhance memorization at the cost of lowering novelty. We further establish a scaling law relating inference compute and novelty in generations.

GP-MoLFormer: A Foundation Model For Molecular Generation

TL;DR

GP-MoLFormer is a decoder-only transformer trained on up to B SMILES strings, enabling unconditional de novo generation and targeted molecular design. It introduces pair-tuning, a parameter-efficient soft-prompt approach that converts a seed molecule into more optimal candidates using ordered molecule pairs, without updating the base model. The method achieves competitive or superior performance across de novo generation, scaffold decoration, and three property optimization tasks (penalized logP, QED, DRD2 activity), while revealing memorization and data-bias effects at scale. The study also uncovers an inference-compute vs novelty scaling law and discusses implications for training-data quality and evaluation in large-scale chemical language models.

Abstract

Transformer-based models trained on large and general purpose datasets consisting of molecular strings have recently emerged as a powerful tool for successfully modeling various structure-property relations. Inspired by this success, we extend the paradigm of training chemical language transformers on large-scale chemical datasets to generative tasks in this work. Specifically, we propose GP-MoLFormer, an autoregressive molecular string generator that is trained on more than 1.1B (billion) chemical SMILES. GP-MoLFormer uses a 46.8M parameter transformer decoder model with linear attention and rotary positional encodings as the base architecture. GP-MoLFormer's utility is evaluated and compared with that of existing baselines on three different tasks: de novo generation, scaffold-constrained molecular decoration, and unconstrained property-guided optimization. While the first two are handled with no additional training, we propose a parameter-efficient fine-tuning method for the last task, which uses property-ordered molecular pairs as input. We call this new approach pair-tuning. Our results show GP-MoLFormer performs better or comparable with baselines across all three tasks, demonstrating its general utility for a variety of molecular generation tasks. We further report strong memorization of training data in GP-MoLFormer generations, which has so far remained unexplored for chemical language models. Our analyses reveal that training data memorization and novelty in generations are impacted by the quality and scale of the training data; duplication bias in training data can enhance memorization at the cost of lowering novelty. We further establish a scaling law relating inference compute and novelty in generations.
Paper Structure (7 sections, 3 equations, 9 figures, 10 tables, 1 algorithm)

This paper contains 7 sections, 3 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: GP-MoLFormer --- a generative pre-trained molecular foundation model. (A) Unconditional generation using GP-MoLFormer. SMILES representations are generated autoregressively and randomly along the learned manifold (purple area). (B) During pair-tuning, a prompt vector is learned, which translates a given molecular representation (light blue dots) to an optimized region of the manifold (red diamonds).
  • Figure 2: Property distributions of different test datasets --- MOSES, ZINC-15 (MolGen-7b), and GP-MoLFormer-Uniq (ours) --- along with generated samples from GP-MoLFormer-Uniq and GP-MoLFormer-Druglike. Clockwise from top left: octanol-water partition coefficient, drug-likeness, synthetic accessibility, molecular weight. Our test distributions are consistently wider (more diverse) than the other baselines. Furthermore, the generated distribution matches the corresponding test distribution almost exactly. In comparison to GP-MoLFormer-Uniq, a density shift toward higher QED values with GP-MoLFormer-Druglike can be observed, as expected.
  • Figure S1: The novelty of the scaffold of each generated molecule with GP-MoLFormer-Uniq compared to the most similar scaffold in the training set.
  • Figure S2: Effect of temperature during multinomial sampling of 100k generated molecules on the percentage of novel molecules (blue) with respect to full 1.1B training set, novelty with respect to a random 130M subset of the training (orange), validity (green), and uniqueness (red) of generations from GP-MoLFormer.
  • Figure S3: A sample of molecules generated de novo.
  • ...and 4 more figures