Table of Contents
Fetching ...

SmileyLlama: Modifying Large Language Models for Directed Chemical Space Exploration

Joseph M. Cavanagh, Kunyang Sun, Andrew Gritsevskiy, Dorian Bagni, Yingze Wang, Thomas D. Bannister, Teresa Head-Gordon

TL;DR

This work shows that an open-weight LLM foundation model can be repurposed as a chemical language model (CLM) through supervised fine-tuning with engineered prompts and, optionally, Direct Preference Optimization, enabling directed molecule generation without training a CLM from scratch. By training on ChEMBL SMILES and integrating with the iMiner reinforcement learning framework, SmileyLlama can generate valid, drug-like molecules and optimize them for 3D binding against SARS-CoV-2 MPro. The results demonstrate that SFT, together with prompt engineering and DPO, achieves competitive performance on GuacaMol benchmarks and improves task adherence while maintaining diversity; the approach also supports efficient target-specific design with reduced computational burden. The framework generalizes beyond drug discovery to other chemical domains and highlights practical implications for rapid, guided chemical space exploration using LLMs.

Abstract

Here we show that a general-purpose large language model (LLM) chatbot, Llama-3.1-8B-Instruct, can be transformed via supervised fine-tuning of engineered prompts into a chemical language model (CLM), SmileyLlama, for molecule generation. We benchmark SmileyLlama by comparing it to CLMs trained from scratch on large amounts of ChEMBL data for their ability to generate valid and novel drug-like molecules. We also use direct preference optimization to both improve SmileyLlama's adherence to a prompt and to generate molecules within the iMiner reinforcement learning framework to predict new drug molecules with optimized 3D conformations and high binding affinity to drug targets, illustrated with the SARS-Cov-2 Main Protease. This overall framework allows a LLM to speak directly as a CLM which can generate molecules with user-specified properties, rather than acting only as a chatbot with knowledge of chemistry or as a helpful virtual assistant. While our dataset and analyses are geared toward drug discovery, this general procedure can be extended to other chemical applications such as chemical synthesis.

SmileyLlama: Modifying Large Language Models for Directed Chemical Space Exploration

TL;DR

This work shows that an open-weight LLM foundation model can be repurposed as a chemical language model (CLM) through supervised fine-tuning with engineered prompts and, optionally, Direct Preference Optimization, enabling directed molecule generation without training a CLM from scratch. By training on ChEMBL SMILES and integrating with the iMiner reinforcement learning framework, SmileyLlama can generate valid, drug-like molecules and optimize them for 3D binding against SARS-CoV-2 MPro. The results demonstrate that SFT, together with prompt engineering and DPO, achieves competitive performance on GuacaMol benchmarks and improves task adherence while maintaining diversity; the approach also supports efficient target-specific design with reduced computational burden. The framework generalizes beyond drug discovery to other chemical domains and highlights practical implications for rapid, guided chemical space exploration using LLMs.

Abstract

Here we show that a general-purpose large language model (LLM) chatbot, Llama-3.1-8B-Instruct, can be transformed via supervised fine-tuning of engineered prompts into a chemical language model (CLM), SmileyLlama, for molecule generation. We benchmark SmileyLlama by comparing it to CLMs trained from scratch on large amounts of ChEMBL data for their ability to generate valid and novel drug-like molecules. We also use direct preference optimization to both improve SmileyLlama's adherence to a prompt and to generate molecules within the iMiner reinforcement learning framework to predict new drug molecules with optimized 3D conformations and high binding affinity to drug targets, illustrated with the SARS-Cov-2 Main Protease. This overall framework allows a LLM to speak directly as a CLM which can generate molecules with user-specified properties, rather than acting only as a chatbot with knowledge of chemistry or as a helpful virtual assistant. While our dataset and analyses are geared toward drug discovery, this general procedure can be extended to other chemical applications such as chemical synthesis.
Paper Structure (9 sections, 5 figures, 2 tables)

This paper contains 9 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: A visualization of the SFT workflow for Smiley-Llama. Given the Llama-3.1-8B-Instruct model dubey_llama_2024, we used prompt-response pairs consisting of calculated molecular properties and completed SMILES strings to fine-tune Llama on SMILES strings completions, yielding SmileyLlama. Crucially, we construct the prompt for each example using properties calculated from the correct response (a SMILES string from ChEMBLv33).
  • Figure 2: Distribution comparisons for different properties of the generated molecules from SmileyLlama (blue) with molecules from the training dataset from ChEMBL (gold). (a) UMAP visualization of a random selection of 10,000 ChEMBL molecules and 10,000 SmileyLlama-generated molecules, using 15 neighbors and a minimum distance of 0.1; these are normal values in chemical space visualizationorlovHighDimensionsHuman2025. (b) The molecular properties considered are fraction of $sp^3$ hybridized carbons and heteroatoms, number of heavy atoms, number of H-bond donors and acceptors, number of aliphatic and aromatic rings and the maximum ring size, number of rotatable bonds, quantitative estimate of drug-likelihood (QED) valuebickerton2012quantifying, MW, approximate log partition coefficient between octanol and water (ALOGP)wildman1999prediction, polarizable surface area (PSA) and topological PSA Prasanna2009, and the number of structural alerts brenk2008lessons. All benchmarks were at a temperature $T=$1.0 and a maximum of 256 new tokens.
  • Figure 3: Conditional generation with SmileyLlama for fragment growth and before and after DPO compared to ChEMBL. (a) Example molecules generated by growing from one of the Enamine substructures and to satisfy Lipinski's Rule of 5 using the prompt Output a SMILES string for a drug like molecule with the following properties: a substructure of O=C(O)c1ccc(C(F)(F)F)cc1, <= 500 MW, <=5 logP, <= 5 H-bond donors, <= 10 H-bond acceptors. (b) distribution of four properties satisfying Lipinski's rule of five comparing ChEMBL molecules (orange) with molecules generated by SmileyLlama (blue) with the prompt Output a SMILES string for a drug like molecule with the following properties: <= 5 H-bond donors, <= 10 H-bond acceptors, <= 500 MW, <= 5 logP, compared to 1000 molecules generated by SmileyLlama with the same prompt after DPO (gray). MW and LogP distributions were estimated using a gaussian kernel density estimator (KDE).KernelDensityEstimators1992 All results generated 1000 molecules at a temperature $T=$1.0 and a maximum of 128 new tokens.
  • Figure 4: Comparison of the shift in docking score distributions for iMiner compared to SmileyLlama over optimization epochs as illustrated for SARS2-MPro. (a) For iMiner, in later epochs diversity crashes which explains the sharpening peaks in later iterations. SmileyLlama with DPO (SL+DPO) enforces diversity throughout the optimizations (Algorithm S3), which accounts for the broad peaks, and shows superior data efficiency relative to iMiner. (b) We compare two different user prompts: Sars2Pro and Sars2Pro+Ro5. All results were generated with 2000 valid SMILES at a temperature of $T=$1.0 and a maximum of 128 new tokens.
  • Figure 5: SmileyLlama de novo generated molecules in the active site of SARS2 main protease. Surface rendering of the SmileyLlama generated molecules in the SARS2 Mpro canonical binding pocket. Generated by SmileyLlama after optimization with (a) the SARS2PRO prompt. (b) and (c) the SARS2Pro+Ro5 prompt. Supplementary Table S2 provides their SMILES string and docking scores, and Supplementary Figure S3 shows their docking pose, for some of the highest scoring ligands.