Table of Contents
Fetching ...

Many-Shot In-Context Learning for Molecular Inverse Design

Saeed Moayedpour, Alejandro Corrochano-Navarro, Faryad Sahneh, Shahriar Noroozizadeh, Alexander Koetter, Jiri Vymetal, Lorenzo Kogler-Anele, Pablo Mas, Yasser Jangjou, Sizhen Li, Michael Bailey, Marc Bianciotto, Hans Matter, Christoph Grebner, Gerhard Hessler, Ziv Bar-Joseph, Sven Jager

TL;DR

This work tackles the scarcity of experimental data in molecular design by introducing a semi-supervised many-shot in-context learning framework that iteratively augments LLM-driven molecule generation with self-generated high-performing candidates and experimental data. It combines a multi-modal, interactive design interface with task-specific evaluators trained on diverse molecular representations to robustly guide design across single and multi-objective criteria. Empirical results demonstrate improved generation quality, greater novelty, and the ability to satisfy multiple property constraints, while highlighting that LLM-based QSAR can learn structure–activity relationships even when not outperforming strong traditional models. The approach offers a scalable, human-in-the-loop pathway for accelerated lead optimization and property-guided molecular design in drug discovery.

Abstract

Large Language Models (LLMs) have demonstrated great performance in few-shot In-Context Learning (ICL) for a variety of generative and discriminative chemical design tasks. The newly expanded context windows of LLMs can further improve ICL capabilities for molecular inverse design and lead optimization. To take full advantage of these capabilities we developed a new semi-supervised learning method that overcomes the lack of experimental data available for many-shot ICL. Our approach involves iterative inclusion of LLM generated molecules with high predicted performance, along with experimental data. We further integrated our method in a multi-modal LLM which allows for the interactive modification of generated molecular structures using text instructions. As we show, the new method greatly improves upon existing ICL methods for molecular design while being accessible and easy to use for scientists.

Many-Shot In-Context Learning for Molecular Inverse Design

TL;DR

This work tackles the scarcity of experimental data in molecular design by introducing a semi-supervised many-shot in-context learning framework that iteratively augments LLM-driven molecule generation with self-generated high-performing candidates and experimental data. It combines a multi-modal, interactive design interface with task-specific evaluators trained on diverse molecular representations to robustly guide design across single and multi-objective criteria. Empirical results demonstrate improved generation quality, greater novelty, and the ability to satisfy multiple property constraints, while highlighting that LLM-based QSAR can learn structure–activity relationships even when not outperforming strong traditional models. The approach offers a scalable, human-in-the-loop pathway for accelerated lead optimization and property-guided molecular design in drug discovery.

Abstract

Large Language Models (LLMs) have demonstrated great performance in few-shot In-Context Learning (ICL) for a variety of generative and discriminative chemical design tasks. The newly expanded context windows of LLMs can further improve ICL capabilities for molecular inverse design and lead optimization. To take full advantage of these capabilities we developed a new semi-supervised learning method that overcomes the lack of experimental data available for many-shot ICL. Our approach involves iterative inclusion of LLM generated molecules with high predicted performance, along with experimental data. We further integrated our method in a multi-modal LLM which allows for the interactive modification of generated molecular structures using text instructions. As we show, the new method greatly improves upon existing ICL methods for molecular design while being accessible and easy to use for scientists.
Paper Structure (19 sections, 7 equations, 10 figures, 1 table)

This paper contains 19 sections, 7 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: A) Distribution of molecular activities against MMP8 protein target in the lead molecules dataset (Lead) and the subset of training datasets that include the top 5 to 500 highly active molecules. B) Distribution of activities in lead molecules and predicted activities of generated molecules with 5 to 500 shots. C) FCD distance of generated molecules from the lead molecules in 5 to 500 shot ICL experiments. D) Distribution of activities in lead molecules and predicted activities of generated molecules in different iterations of including self-generated molecules and predicted activities in the context.
  • Figure 2: Distribution of (a) molecular weight, (b) SA score, (c) logP, and (d) tPSA in the training dataset (Train), lead dataset (Lead), 500-shot ICL without any property criteria other than activity (No condition), 500-shot ICL with specified property condition without providing the property labels in the context (Condition w/o labels), and 500-shot ICL with specified property condition and provided the property labels in the context (Condition w/ labels).
  • Figure 3: Example lead molecules and the resulting generated molecules along with their predicted properties. The improved properties are highlighted.
  • Figure 4: Scatter plots representing the relationship between predicted and experimental activities of a validation dataset from cross validation folds for MMP8 protein target. (A) shows the activities predicted by the LLM model, while (B), (C), and (D) depict the results of CatBoost regression models trained on diverse input features; circular fingerprints with a radius of 3 and a 2048-bit vector size, RDKit descriptors, and Mol2Vec features, respectively.
  • Figure 5: Overview of the iterative design process. Our tool aims to address the challenge of SMILES modification, which requires extensive understanding of structural chemistry and SMILES notations.
  • ...and 5 more figures