Table of Contents
Fetching ...

Small Molecule Optimization with Large Language Models

Philipp Guevorguian, Menua Bedrosian, Tigran Fahradyan, Gayane Chilingaryan, Hrant Khachatrian, Armen Aghajanyan

TL;DR

This work shows that large language models can drive efficient, property-targeted molecular design when trained on a rich PubChem-derived corpus and combined with an LLM-enabled optimization loop. The authors introduce Chemlactica and Chemma, two LLMs fine-tuned on 110M molecules and their computed properties, and pair them with a genetic-algorithm–like optimization framework that uses rejection sampling and prompt optimization to navigate chemical space. Across Practical Molecular Optimization, docking-based multi-property optimization, and QED-similarity constrained design, the approach achieves state-of-the-art performance, including an 8% improvement on PMO and strong results under strict oracle budgets. The work provides a scalable path for rapid task adaptation with a few hundred fine-tuning examples and releases the training corpus, models, and optimization recipes for reproducibility and broader use in drug discovery. The findings highlight the practical potential and limitations of SMILES-based LLMs in guided molecular design, with broader societal considerations and safety implications discussed.

Abstract

Recent advancements in large language models have opened new possibilities for generative molecular drug design. We present Chemlactica and Chemma, two language models fine-tuned on a novel corpus of 110M molecules with computed properties, totaling 40B tokens. These models demonstrate strong performance in generating molecules with specified properties and predicting new molecular characteristics from limited samples. We introduce a novel optimization algorithm that leverages our language models to optimize molecules for arbitrary properties given limited access to a black box oracle. Our approach combines ideas from genetic algorithms, rejection sampling, and prompt optimization. It achieves state-of-the-art performance on multiple molecular optimization benchmarks, including an 8% improvement on Practical Molecular Optimization compared to previous methods. We publicly release the training corpus, the language models and the optimization algorithm.

Small Molecule Optimization with Large Language Models

TL;DR

This work shows that large language models can drive efficient, property-targeted molecular design when trained on a rich PubChem-derived corpus and combined with an LLM-enabled optimization loop. The authors introduce Chemlactica and Chemma, two LLMs fine-tuned on 110M molecules and their computed properties, and pair them with a genetic-algorithm–like optimization framework that uses rejection sampling and prompt optimization to navigate chemical space. Across Practical Molecular Optimization, docking-based multi-property optimization, and QED-similarity constrained design, the approach achieves state-of-the-art performance, including an 8% improvement on PMO and strong results under strict oracle budgets. The work provides a scalable path for rapid task adaptation with a few hundred fine-tuning examples and releases the training corpus, models, and optimization recipes for reproducibility and broader use in drug discovery. The findings highlight the practical potential and limitations of SMILES-based LLMs in guided molecular design, with broader societal considerations and safety implications discussed.

Abstract

Recent advancements in large language models have opened new possibilities for generative molecular drug design. We present Chemlactica and Chemma, two language models fine-tuned on a novel corpus of 110M molecules with computed properties, totaling 40B tokens. These models demonstrate strong performance in generating molecules with specified properties and predicting new molecular characteristics from limited samples. We introduce a novel optimization algorithm that leverages our language models to optimize molecules for arbitrary properties given limited access to a black box oracle. Our approach combines ideas from genetic algorithms, rejection sampling, and prompt optimization. It achieves state-of-the-art performance on multiple molecular optimization benchmarks, including an 8% improvement on Practical Molecular Optimization compared to previous methods. We publicly release the training corpus, the language models and the optimization algorithm.
Paper Structure (57 sections, 1 equation, 11 figures, 14 tables, 2 algorithms)

This paper contains 57 sections, 1 equation, 11 figures, 14 tables, 2 algorithms.

Figures (11)

  • Figure 1: Model calibration on synthetic multiple choice question where y=x represents perfect calibration.
  • Figure 2: Illustration of errors made by Chemma-2B during property prediction and conditional generation for various properties.
  • Figure 3: Optimization process visualization using Chemlactica-125M model for $sitagliptin\_mpo$ task with four different seeds.
  • Figure 4: Optimization process visualization using Chemlactica-1.3B model for $sitagliptin\_mpo$ task with four different seeds.
  • Figure 5: Optimization process visualization using Chemma-2B model for $sitagliptin\_mpo$ task with four different seeds.
  • ...and 6 more figures