Table of Contents
Fetching ...

GeLLMO: Generalizing Large Language Models for Multi-property Molecule Optimization

Vishal Dey, Xiao Hu, Xia Ning

TL;DR

GeLLM^3O introduces MuMOInstruct, the first large-scale instruction-tuning dataset designed for challenging multi-property molecule optimization, and trains both task-specific and generalist GeLLM^3O models that learn property trade-offs across diverse contexts. Through 0-shot evaluation on 5 IND and 5 OOD tasks, these models outperform strong general-purpose LLMs, chemistry-focused baselines, and task-specific non-LLMs, with generalist variants delivering robust zero-shot generalization to unseen tasks and instructions. The combination of MuMOInstruct with LoRA-finetuned LLMs yields strong, scalable performance without task-specific retraining, highlighting the potential of GeLLM^3O as foundational models for molecule optimization in drug discovery. The work demonstrates significant practical impact by enabling efficient exploration of multi-property landscapes while maintaining scaffold similarity, paving the way for adaptable, task-agnostic optimization in evolving therapeutic contexts, and providing open access to data, models, and code.

Abstract

Despite recent advancements, most computational methods for molecule optimization are constrained to single- or double-property optimization tasks and suffer from poor scalability and generalizability to novel optimization tasks. Meanwhile, Large Language Models (LLMs) demonstrate remarkable out-of-domain generalizability to novel tasks. To demonstrate LLMs' potential for molecule optimization, we introduce MuMOInstruct, the first high-quality instruction-tuning dataset specifically focused on complex multi-property molecule optimization tasks. Leveraging MuMOInstruct, we develop GeLLMOs, a series of instruction-tuned LLMs for molecule optimization. Extensive evaluations across 5 in-domain and 5 out-of-domain tasks demonstrate that GeLLMOs consistently outperform state-of-the-art baselines. GeLLMOs also exhibit outstanding zero-shot generalization to unseen tasks, significantly outperforming powerful closed-source LLMs. Such strong generalizability demonstrates the tremendous potential of GeLLMOs as foundational models for molecule optimization, thereby tackling novel optimization tasks without resource-intensive retraining. MuMOInstruct, models, and code are accessible through https://github.com/ninglab/GeLLMO.

GeLLMO: Generalizing Large Language Models for Multi-property Molecule Optimization

TL;DR

GeLLM^3O introduces MuMOInstruct, the first large-scale instruction-tuning dataset designed for challenging multi-property molecule optimization, and trains both task-specific and generalist GeLLM^3O models that learn property trade-offs across diverse contexts. Through 0-shot evaluation on 5 IND and 5 OOD tasks, these models outperform strong general-purpose LLMs, chemistry-focused baselines, and task-specific non-LLMs, with generalist variants delivering robust zero-shot generalization to unseen tasks and instructions. The combination of MuMOInstruct with LoRA-finetuned LLMs yields strong, scalable performance without task-specific retraining, highlighting the potential of GeLLM^3O as foundational models for molecule optimization in drug discovery. The work demonstrates significant practical impact by enabling efficient exploration of multi-property landscapes while maintaining scaffold similarity, paving the way for adaptable, task-agnostic optimization in evolving therapeutic contexts, and providing open access to data, models, and code.

Abstract

Despite recent advancements, most computational methods for molecule optimization are constrained to single- or double-property optimization tasks and suffer from poor scalability and generalizability to novel optimization tasks. Meanwhile, Large Language Models (LLMs) demonstrate remarkable out-of-domain generalizability to novel tasks. To demonstrate LLMs' potential for molecule optimization, we introduce MuMOInstruct, the first high-quality instruction-tuning dataset specifically focused on complex multi-property molecule optimization tasks. Leveraging MuMOInstruct, we develop GeLLMOs, a series of instruction-tuned LLMs for molecule optimization. Extensive evaluations across 5 in-domain and 5 out-of-domain tasks demonstrate that GeLLMOs consistently outperform state-of-the-art baselines. GeLLMOs also exhibit outstanding zero-shot generalization to unseen tasks, significantly outperforming powerful closed-source LLMs. Such strong generalizability demonstrates the tremendous potential of GeLLMOs as foundational models for molecule optimization, thereby tackling novel optimization tasks without resource-intensive retraining. MuMOInstruct, models, and code are accessible through https://github.com/ninglab/GeLLMO.

Paper Structure

This paper contains 59 sections, 2 equations, 9 figures, 24 tables.

Figures (9)

  • Figure 1: Overview of $\mathop{\mathtt{MuMOInstruct}}\limits$ and $\mathop{\mathtt{GeLLM^3O}}\limits$
  • Figure 2: An optimization case on $\mathop{\mathtt{BHMQ}}\limits$. Modifications are highlighted in red.
  • Figure A1: Prompt template used for instruction-tuning $\mathop{\mathtt{GeLLM^3O}}\limits$s
  • Figure A2: An example of a prompt used for general-purpose LLMs
  • Figure A3: An example of a prompt used for $\mathop{\mathtt{LlaSMol}}\limits$
  • ...and 4 more figures