Table of Contents
Fetching ...

ChatMol: A Versatile Molecule Designer Based on the Numerically Enhanced Large Language Model

Chuanliu Fan, Ziqiang Cao, Zicheng Ma, Nan Yu, Yimin Peng, Jun Zhang, Yiqin Gao, Guohong Fu

TL;DR

ChatMol reframes de novo molecule design as constrained generation with large language models by introducing a concise, language-like molecule representation and a unified numerical encoding for numeric prompts. It employs a two-stage training pipeline with supervised learning and ranking-based calibration, augmented by numerical embeddings to improve adherence to numeric constraints. Across single-property, substructure-constrained, and multi-property tasks, ChatMol surpasses traditional latent-space and RL-based baselines, achieving strong performance in logP targeting, substructure adherence, and binding affinity optimization (K_D) for targets like ESR1 and ACAA1. The results suggest LLMs can serve as flexible, direct generative engines for complex drug-design constraints, with significant gains observed when scaling and numerical enhancement are applied.

Abstract

Goal-oriented de novo molecule design, namely generating molecules with specific property or substructure constraints, is a crucial yet challenging task in drug discovery. Existing methods, such as Bayesian optimization and reinforcement learning, often require training multiple property predictors and struggle to incorporate substructure constraints. Inspired by the success of Large Language Models (LLMs) in text generation, we propose ChatMol, a novel approach that leverages LLMs for molecule design across diverse constraint settings. Initially, we crafted a molecule representation compatible with LLMs and validated its efficacy across multiple online LLMs. Afterwards, we developed specific prompts geared towards diverse constrained molecule generation tasks to further fine-tune current LLMs while integrating feedback learning derived from property prediction. Finally, to address the limitations of LLMs in numerical recognition, we referred to the position encoding method and incorporated additional encoding for numerical values within the prompt. Experimental results across single-property, substructure-property, and multi-property constrained tasks demonstrate that ChatMol consistently outperforms state-of-the-art baselines, including VAE and RL-based methods. Notably, in multi-objective binding affinity maximization task, ChatMol achieves a significantly lower KD value of 0.25 for the protein target ESR1, while maintaining the highest overall performance, surpassing previous methods by 4.76%. Meanwhile, with numerical enhancement, the Pearson correlation coefficient between the instructed property values and those of the generated molecules increased by up to 0.49. These findings highlight the potential of LLMs as a versatile framework for molecule generation, offering a promising alternative to traditional latent space and RL-based approaches.

ChatMol: A Versatile Molecule Designer Based on the Numerically Enhanced Large Language Model

TL;DR

ChatMol reframes de novo molecule design as constrained generation with large language models by introducing a concise, language-like molecule representation and a unified numerical encoding for numeric prompts. It employs a two-stage training pipeline with supervised learning and ranking-based calibration, augmented by numerical embeddings to improve adherence to numeric constraints. Across single-property, substructure-constrained, and multi-property tasks, ChatMol surpasses traditional latent-space and RL-based baselines, achieving strong performance in logP targeting, substructure adherence, and binding affinity optimization (K_D) for targets like ESR1 and ACAA1. The results suggest LLMs can serve as flexible, direct generative engines for complex drug-design constraints, with significant gains observed when scaling and numerical enhancement are applied.

Abstract

Goal-oriented de novo molecule design, namely generating molecules with specific property or substructure constraints, is a crucial yet challenging task in drug discovery. Existing methods, such as Bayesian optimization and reinforcement learning, often require training multiple property predictors and struggle to incorporate substructure constraints. Inspired by the success of Large Language Models (LLMs) in text generation, we propose ChatMol, a novel approach that leverages LLMs for molecule design across diverse constraint settings. Initially, we crafted a molecule representation compatible with LLMs and validated its efficacy across multiple online LLMs. Afterwards, we developed specific prompts geared towards diverse constrained molecule generation tasks to further fine-tune current LLMs while integrating feedback learning derived from property prediction. Finally, to address the limitations of LLMs in numerical recognition, we referred to the position encoding method and incorporated additional encoding for numerical values within the prompt. Experimental results across single-property, substructure-property, and multi-property constrained tasks demonstrate that ChatMol consistently outperforms state-of-the-art baselines, including VAE and RL-based methods. Notably, in multi-objective binding affinity maximization task, ChatMol achieves a significantly lower KD value of 0.25 for the protein target ESR1, while maintaining the highest overall performance, surpassing previous methods by 4.76%. Meanwhile, with numerical enhancement, the Pearson correlation coefficient between the instructed property values and those of the generated molecules increased by up to 0.49. These findings highlight the potential of LLMs as a versatile framework for molecule generation, offering a promising alternative to traditional latent space and RL-based approaches.

Paper Structure

This paper contains 24 sections, 6 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Examples of diverse conditional generation tasks in drug design addressed by ChatMol: single-objective logP targeting, substructure-constrained logP optimization, and multi-objective binding affinity maximization.
  • Figure 2: Illustration of numerical enhancement in the training process. Property values are transformed through $\mathcal{N}(\cdot)$ to obtain a holistic numerical encoding, which is then added to each numerical token's word embedding to produce the final encoding of the constraint conditions. $\mathcal{E}(\cdot)$ represents the embedding layer.
  • Figure 3: Extremization of logP property with substructure (highlighted in red) fixed. The logP value within the red box represents our target value, and together with the substructure, it constitutes the prompt for this task.
  • Figure 4: Training stages of the ChatMol. The yellow squares below represent tokens used for calculating loss, and the white squares below refer to the condition tokens. During the sequence calibration stage, each training example corresponds to $N$ candidate molecules.
  • Figure 5: Molecules generated in the single-objective logP targeting task.
  • ...and 4 more figures