Table of Contents
Fetching ...

Property Enhanced Instruction Tuning for Multi-task Molecule Generation with Large Language Models

Xuan Lin, Long Chen, Yile Wang, Xiangxiang Zeng, Philip S. Yu

TL;DR

This work tackles enabling LLMs to perform multi-task molecule generation under multiple property constraints by introducing PEIT, a two-step framework that first builds a multimodal pre-trained generator PEIT-GEN to align textual descriptions, SMILES strings, and biochemical properties, and then instruction-tunes open-source LLMs to PEIT-LLM. PEIT-GEN learns cross-modal representations via matching, contrastive learning, and cross-modal language modeling to generate diverse instruction data, while PEIT-LLM is fine-tuned with template-filled data to support molecule captioning, text-based generation, property prediction, and multi-constraint generation. Experimental results show PEIT-GEN outperforms MolT5 and BioT5 in molecule captioning; PEIT-LLM surpasses several baselines on captioning, generation, and property prediction, including multi-constraint generation, with strong generalization in out-of-distribution settings. The framework demonstrates scalable, data-efficient, multi-task molecular reasoning for open LLMs and provides code and data to foster further research.

Abstract

Large language models (LLMs) are widely applied in various natural language processing tasks such as question answering and machine translation. However, due to the lack of labeled data and the difficulty of manual annotation for biochemical properties, the performance for molecule generation tasks is still limited, especially for tasks involving multi-properties constraints. In this work, we present a two-step framework PEIT (Property Enhanced Instruction Tuning) to improve LLMs for molecular-related tasks. In the first step, we use textual descriptions, SMILES, and biochemical properties as multimodal inputs to pre-train a model called PEIT-GEN, by aligning multi-modal representations to synthesize instruction data. In the second step, we fine-tune existing open-source LLMs with the synthesized data, the resulting PEIT-LLM can handle molecule captioning, text-based molecule generation, molecular property prediction, and our newly proposed multi-constraint molecule generation tasks. Experimental results show that our pre-trained PEIT-GEN outperforms MolT5 and BioT5 in molecule captioning, demonstrating modalities align well between textual descriptions, structures, and biochemical properties. Furthermore, PEIT-LLM shows promising improvements in multi-task molecule generation, proving the scalability of the PEIT framework for various molecular tasks. We release the code, constructed instruction data, and model checkpoints in https://github.com/chenlong164/PEIT.

Property Enhanced Instruction Tuning for Multi-task Molecule Generation with Large Language Models

TL;DR

This work tackles enabling LLMs to perform multi-task molecule generation under multiple property constraints by introducing PEIT, a two-step framework that first builds a multimodal pre-trained generator PEIT-GEN to align textual descriptions, SMILES strings, and biochemical properties, and then instruction-tunes open-source LLMs to PEIT-LLM. PEIT-GEN learns cross-modal representations via matching, contrastive learning, and cross-modal language modeling to generate diverse instruction data, while PEIT-LLM is fine-tuned with template-filled data to support molecule captioning, text-based generation, property prediction, and multi-constraint generation. Experimental results show PEIT-GEN outperforms MolT5 and BioT5 in molecule captioning; PEIT-LLM surpasses several baselines on captioning, generation, and property prediction, including multi-constraint generation, with strong generalization in out-of-distribution settings. The framework demonstrates scalable, data-efficient, multi-task molecular reasoning for open LLMs and provides code and data to foster further research.

Abstract

Large language models (LLMs) are widely applied in various natural language processing tasks such as question answering and machine translation. However, due to the lack of labeled data and the difficulty of manual annotation for biochemical properties, the performance for molecule generation tasks is still limited, especially for tasks involving multi-properties constraints. In this work, we present a two-step framework PEIT (Property Enhanced Instruction Tuning) to improve LLMs for molecular-related tasks. In the first step, we use textual descriptions, SMILES, and biochemical properties as multimodal inputs to pre-train a model called PEIT-GEN, by aligning multi-modal representations to synthesize instruction data. In the second step, we fine-tune existing open-source LLMs with the synthesized data, the resulting PEIT-LLM can handle molecule captioning, text-based molecule generation, molecular property prediction, and our newly proposed multi-constraint molecule generation tasks. Experimental results show that our pre-trained PEIT-GEN outperforms MolT5 and BioT5 in molecule captioning, demonstrating modalities align well between textual descriptions, structures, and biochemical properties. Furthermore, PEIT-LLM shows promising improvements in multi-task molecule generation, proving the scalability of the PEIT framework for various molecular tasks. We release the code, constructed instruction data, and model checkpoints in https://github.com/chenlong164/PEIT.

Paper Structure

This paper contains 19 sections, 11 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: (a) An example of our proposed multi-constraint molecule generation task. (b) The response by ChatGPT. (c) The result generated by MolT5. (d) The response generated by the LLaMA3.1 model after applying our proposed property-enhanced instruction tuning, with the results validated by RDKit.
  • Figure 2: Left: Overall PEIT framework. We first pre-train the PEIT-GEN and construct instruction data via template filling. Then we fine-tune the open-source LLMs through instruction tuning, the resulting PEIT-LLM is used for multi-task molecule generation. Right: The process of PEIT-GEN pre-training, see details in Section \ref{['sec:peit-gen']}.
  • Figure 3: The cross-modal causal language modeling.
  • Figure 4: Ablation study on PEIT-GEN pre-training objectives $\mathcal{L}^{sp}_{\text{match}}$, $\mathcal{L}^{st}_{\text{match}}$, $\mathcal{L}^{sp}_{\text{contrastive}}$, and $\mathcal{L}^{st}_{\text{contrastive}}$.
  • Figure 5: The impact of different amount of SFT steps on molecule captioning (left) and generation (right).
  • ...and 3 more figures