Table of Contents
Fetching ...

Empirical Evidence for the Fragment level Understanding on Drug Molecular Structure of LLMs

Xiuyuan Hu, Guoqing Liu, Yang Zhao, Hao Zhang

TL;DR

This work tackles how SMILES-based transformers internalize chemical spatial structure. By pre-training on SMILES and applying REINFORCE RL fine-tuning for drug design, the authors show that the model increasingly generates high-frequency SMILES substrings that align with 2D molecular fragments, indicating fragment-level understanding rather than mere sequence fitting. The approach is validated on three GuacaMol drug rediscovery tasks, where RL fine-tuning improves top-1 scores and the efficiency of constructing target SMILES from learned substrings. The findings offer insight into the interpretability of chemical language models and support fragment-based perspectives for future drug-design AI systems.

Abstract

AI for drug discovery has been a research hotspot in recent years, and SMILES-based language models has been increasingly applied in drug molecular design. However, no work has explored whether and how language models understand the chemical spatial structure from 1D sequences. In this work, we pre-train a transformer model on chemical language and fine-tune it toward drug design objectives, and investigate the correspondence between high-frequency SMILES substrings and molecular fragments. The results indicate that language models can understand chemical structures from the perspective of molecular fragments, and the structural knowledge learned through fine-tuning is reflected in the high-frequency SMILES substrings generated by the model.

Empirical Evidence for the Fragment level Understanding on Drug Molecular Structure of LLMs

TL;DR

This work tackles how SMILES-based transformers internalize chemical spatial structure. By pre-training on SMILES and applying REINFORCE RL fine-tuning for drug design, the authors show that the model increasingly generates high-frequency SMILES substrings that align with 2D molecular fragments, indicating fragment-level understanding rather than mere sequence fitting. The approach is validated on three GuacaMol drug rediscovery tasks, where RL fine-tuning improves top-1 scores and the efficiency of constructing target SMILES from learned substrings. The findings offer insight into the interpretability of chemical language models and support fragment-based perspectives for future drug-design AI systems.

Abstract

AI for drug discovery has been a research hotspot in recent years, and SMILES-based language models has been increasingly applied in drug molecular design. However, no work has explored whether and how language models understand the chemical spatial structure from 1D sequences. In this work, we pre-train a transformer model on chemical language and fine-tune it toward drug design objectives, and investigate the correspondence between high-frequency SMILES substrings and molecular fragments. The results indicate that language models can understand chemical structures from the perspective of molecular fragments, and the structural knowledge learned through fine-tuning is reflected in the high-frequency SMILES substrings generated by the model.
Paper Structure (13 sections, 2 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 13 sections, 2 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: The changing curves of valid ratio and training loss during the pre-training process on chemical language.
  • Figure 2: The score curves during the RL fine-tuning processes on three drug design tasks.
  • Figure 3: Three target drug structures for rediscovery tasks.
  • Figure 4: The changing curves of numbers of high-frequency fragments extracted by SPE algorithm during the RL fine-tuning processes on three drug design tasks.
  • Figure 5: The changing curves of numbers of high-frequency fragments to compose into the given SMILES strings during the RL fine-tuning processes on three drug design tasks.
  • ...and 1 more figures