Can Large Language Models Understand Molecules?

Shaghayegh Sadeghi; Alan Bui; Ali Forooghi; Jianguo Lu; Alioune Ngom

Can Large Language Models Understand Molecules?

Shaghayegh Sadeghi, Alan Bui, Ali Forooghi, Jianguo Lu, Alioune Ngom

TL;DR

This study investigates whether large language models can generate meaningful SMILES embeddings for molecular property and DDI prediction, benchmarking GPT and LLaMA against pre-trained SMILES models. By embedding SMILES strings from the models’ last layers and evaluating on MoleculeNet and DDI datasets with $5$-fold cross-validation, the authors find that LLaMA-based embeddings generally outperform GPT and are competitive with established SMILES models, particularly for DDI prediction. An ablation analysis shows LLaMA2 often surpasses LLaMA, benefiting from $40 ext{ extperthousand}$ more training data and a context length of $4096$ tokens, though tokenization differences play a role. The results suggest LLMs, especially LLaMA variants, can serve as viable, scalable molecular embeddings, motivating further work on fine-tuning, tokenization, and isotropy-aware embedding methods to enhance downstream predictive power.

Abstract

Purpose: Large Language Models (LLMs) like GPT (Generative Pre-trained Transformer) from OpenAI and LLaMA (Large Language Model Meta AI) from Meta AI are increasingly recognized for their potential in the field of cheminformatics, particularly in understanding Simplified Molecular Input Line Entry System (SMILES), a standard method for representing chemical structures. These LLMs also have the ability to decode SMILES strings into vector representations. Method: We investigate the performance of GPT and LLaMA compared to pre-trained models on SMILES in embedding SMILES strings on downstream tasks, focusing on two key applications: molecular property prediction and drug-drug interaction prediction. Results: We find that SMILES embeddings generated using LLaMA outperform those from GPT in both molecular property and DDI prediction tasks. Notably, LLaMA-based SMILES embeddings show results comparable to pre-trained models on SMILES in molecular prediction tasks and outperform the pre-trained models for the DDI prediction tasks. Conclusion: The performance of LLMs in generating SMILES embeddings shows great potential for further investigation of these models for molecular embedding. We hope our study bridges the gap between LLMs and molecular embedding, motivating additional research into the potential of LLMs in the molecular representation field. GitHub: https://github.com/sshaghayeghs/LLaMA-VS-GPT

Can Large Language Models Understand Molecules?

TL;DR

-fold cross-validation, the authors find that LLaMA-based embeddings generally outperform GPT and are competitive with established SMILES models, particularly for DDI prediction. An ablation analysis shows LLaMA2 often surpasses LLaMA, benefiting from

more training data and a context length of

tokens, though tokenization differences play a role. The results suggest LLMs, especially LLaMA variants, can serve as viable, scalable molecular embeddings, motivating further work on fine-tuning, tokenization, and isotropy-aware embedding methods to enhance downstream predictive power.

Abstract

Paper Structure (12 sections, 8 figures, 6 tables)

This paper contains 12 sections, 8 figures, 6 tables.

Introduction
Related Work
LLMs
Experiments
Experimental Setup
Benchmarking Data Sets
Performance Analysis
Results on Classification Tasks
Results on Regression Tasks
Results on Link Prediction Tasks
Ablation Study
Conclusions

Figures (8)

Figure 1: Drug Chemical Representations.
Figure 2: Results on Classification and Regression Tasks. Each Line Represent the Mean Value of 5-Fold Cross Validation While the Shaded Area Shows Their Standard Deviation.
Figure 3: Comparison of LLaMA and LLaMA2 Performance
Figure 4: Effect of Dimension Reduction on The Performance of LLMs
Figure 5: Anisotropy problem of LLM Models Embedding
...and 3 more figures

Can Large Language Models Understand Molecules?

TL;DR

Abstract

Can Large Language Models Understand Molecules?

Authors

TL;DR

Abstract

Table of Contents

Figures (8)