Empirical Evidence for the Fragment level Understanding on Drug Molecular Structure of LLMs

Xiuyuan Hu; Guoqing Liu; Yang Zhao; Hao Zhang

Empirical Evidence for the Fragment level Understanding on Drug Molecular Structure of LLMs

Xiuyuan Hu, Guoqing Liu, Yang Zhao, Hao Zhang

TL;DR

This work tackles how SMILES-based transformers internalize chemical spatial structure. By pre-training on SMILES and applying REINFORCE RL fine-tuning for drug design, the authors show that the model increasingly generates high-frequency SMILES substrings that align with 2D molecular fragments, indicating fragment-level understanding rather than mere sequence fitting. The approach is validated on three GuacaMol drug rediscovery tasks, where RL fine-tuning improves top-1 scores and the efficiency of constructing target SMILES from learned substrings. The findings offer insight into the interpretability of chemical language models and support fragment-based perspectives for future drug-design AI systems.

Abstract

AI for drug discovery has been a research hotspot in recent years, and SMILES-based language models has been increasingly applied in drug molecular design. However, no work has explored whether and how language models understand the chemical spatial structure from 1D sequences. In this work, we pre-train a transformer model on chemical language and fine-tune it toward drug design objectives, and investigate the correspondence between high-frequency SMILES substrings and molecular fragments. The results indicate that language models can understand chemical structures from the perspective of molecular fragments, and the structural knowledge learned through fine-tuning is reflected in the high-frequency SMILES substrings generated by the model.

Empirical Evidence for the Fragment level Understanding on Drug Molecular Structure of LLMs

TL;DR

Abstract

Paper Structure (13 sections, 2 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 13 sections, 2 equations, 6 figures, 1 table, 1 algorithm.

1 Introduction
2 Related Works
2.1 Language Models for Drug Design
2.2 Fragment-based Drug Design
3 Pre-training on Chemical Language
4 Fine-tuning for Drug Molecular Design
4.1 RL Fine-tuning
4.2 Experiments
5 Language Models' Understanding of Molecular Fragments
5.1 SMILES Pair Encoding
5.2 Model's Learning of SMILES Substrings
5.3 Model's Understanding of Molecular Fragments
6 Conclusion

Figures (6)

Figure 1: The changing curves of valid ratio and training loss during the pre-training process on chemical language.
Figure 2: The score curves during the RL fine-tuning processes on three drug design tasks.
Figure 3: Three target drug structures for rediscovery tasks.
Figure 4: The changing curves of numbers of high-frequency fragments extracted by SPE algorithm during the RL fine-tuning processes on three drug design tasks.
Figure 5: The changing curves of numbers of high-frequency fragments to compose into the given SMILES strings during the RL fine-tuning processes on three drug design tasks.
...and 1 more figures

Empirical Evidence for the Fragment level Understanding on Drug Molecular Structure of LLMs

TL;DR

Abstract

Empirical Evidence for the Fragment level Understanding on Drug Molecular Structure of LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (6)