Small Molecule Optimization with Large Language Models

Philipp Guevorguian; Menua Bedrosian; Tigran Fahradyan; Gayane Chilingaryan; Hrant Khachatrian; Armen Aghajanyan

Small Molecule Optimization with Large Language Models

Philipp Guevorguian, Menua Bedrosian, Tigran Fahradyan, Gayane Chilingaryan, Hrant Khachatrian, Armen Aghajanyan

TL;DR

This work shows that large language models can drive efficient, property-targeted molecular design when trained on a rich PubChem-derived corpus and combined with an LLM-enabled optimization loop. The authors introduce Chemlactica and Chemma, two LLMs fine-tuned on 110M molecules and their computed properties, and pair them with a genetic-algorithm–like optimization framework that uses rejection sampling and prompt optimization to navigate chemical space. Across Practical Molecular Optimization, docking-based multi-property optimization, and QED-similarity constrained design, the approach achieves state-of-the-art performance, including an 8% improvement on PMO and strong results under strict oracle budgets. The work provides a scalable path for rapid task adaptation with a few hundred fine-tuning examples and releases the training corpus, models, and optimization recipes for reproducibility and broader use in drug discovery. The findings highlight the practical potential and limitations of SMILES-based LLMs in guided molecular design, with broader societal considerations and safety implications discussed.

Abstract

Recent advancements in large language models have opened new possibilities for generative molecular drug design. We present Chemlactica and Chemma, two language models fine-tuned on a novel corpus of 110M molecules with computed properties, totaling 40B tokens. These models demonstrate strong performance in generating molecules with specified properties and predicting new molecular characteristics from limited samples. We introduce a novel optimization algorithm that leverages our language models to optimize molecules for arbitrary properties given limited access to a black box oracle. Our approach combines ideas from genetic algorithms, rejection sampling, and prompt optimization. It achieves state-of-the-art performance on multiple molecular optimization benchmarks, including an 8% improvement on Practical Molecular Optimization compared to previous methods. We publicly release the training corpus, the language models and the optimization algorithm.

Small Molecule Optimization with Large Language Models

TL;DR

Abstract

Paper Structure (57 sections, 1 equation, 11 figures, 14 tables, 2 algorithms)

This paper contains 57 sections, 1 equation, 11 figures, 14 tables, 2 algorithms.

Introduction
Related Work
Language Models for Molecular Representation
Molecular Optimization Techniques
Recurrent Neural Networks in Molecular Design
Large Language Models in Optimization
Training Corpus
Molecular Database from PubChem
JSONL Corpus Generation
Text Generation Template
Model Training and Evaluation
Selection of Pretrained Language Models
Tokenization and Sample Preparation
Training Methodology
Evaluation of Computed Property Prediction and Conditional Generation
...and 42 more sections

Figures (11)

Figure 1: Model calibration on synthetic multiple choice question where y=x represents perfect calibration.
Figure 2: Illustration of errors made by Chemma-2B during property prediction and conditional generation for various properties.
Figure 3: Optimization process visualization using Chemlactica-125M model for $sitagliptin\_mpo$ task with four different seeds.
Figure 4: Optimization process visualization using Chemlactica-1.3B model for $sitagliptin\_mpo$ task with four different seeds.
Figure 5: Optimization process visualization using Chemma-2B model for $sitagliptin\_mpo$ task with four different seeds.
...and 6 more figures

Small Molecule Optimization with Large Language Models

TL;DR

Abstract

Small Molecule Optimization with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)