Small Molecule Optimization with Large Language Models
Philipp Guevorguian, Menua Bedrosian, Tigran Fahradyan, Gayane Chilingaryan, Hrant Khachatrian, Armen Aghajanyan
TL;DR
This work shows that large language models can drive efficient, property-targeted molecular design when trained on a rich PubChem-derived corpus and combined with an LLM-enabled optimization loop. The authors introduce Chemlactica and Chemma, two LLMs fine-tuned on 110M molecules and their computed properties, and pair them with a genetic-algorithm–like optimization framework that uses rejection sampling and prompt optimization to navigate chemical space. Across Practical Molecular Optimization, docking-based multi-property optimization, and QED-similarity constrained design, the approach achieves state-of-the-art performance, including an 8% improvement on PMO and strong results under strict oracle budgets. The work provides a scalable path for rapid task adaptation with a few hundred fine-tuning examples and releases the training corpus, models, and optimization recipes for reproducibility and broader use in drug discovery. The findings highlight the practical potential and limitations of SMILES-based LLMs in guided molecular design, with broader societal considerations and safety implications discussed.
Abstract
Recent advancements in large language models have opened new possibilities for generative molecular drug design. We present Chemlactica and Chemma, two language models fine-tuned on a novel corpus of 110M molecules with computed properties, totaling 40B tokens. These models demonstrate strong performance in generating molecules with specified properties and predicting new molecular characteristics from limited samples. We introduce a novel optimization algorithm that leverages our language models to optimize molecules for arbitrary properties given limited access to a black box oracle. Our approach combines ideas from genetic algorithms, rejection sampling, and prompt optimization. It achieves state-of-the-art performance on multiple molecular optimization benchmarks, including an 8% improvement on Practical Molecular Optimization compared to previous methods. We publicly release the training corpus, the language models and the optimization algorithm.
