Table of Contents
Fetching ...

Improving Chemical Understanding of LLMs via SMILES Parsing

Yunhui Jang, Jaehyung Kim, Sungsoo Ahn

TL;DR

The paper tackles the challenge that LLMs struggle to interpret SMILES representations, which hampers accurate molecular understanding. It introduces CLEANMOL, a framework that defines five deterministic SMILES parsing tasks—spanning subgraph and global graph information—together with a 250K-molecule dataset annotated automatically via RDKit. A two-stage training pipeline combines task-adaptive data pruning and curriculum learning to pretrain on these parsing tasks and then fine-tune on downstream chemistry tasks, yielding improvements on Mol-Instructions benchmarks and molecular generation. The results demonstrate that explicit, structure-focused supervision can transfer to generation and other downstream tasks, offering a scalable path toward more structurally grounded molecular LLMs with potential impact on drug discovery and materials design.

Abstract

Large language models (LLMs) are increasingly recognized as powerful tools for scientific discovery, particularly in molecular science. A fundamental requirement for these models is the ability to accurately understand molecular structures, commonly encoded in the SMILES representation. However, current LLMs struggle to interpret SMILES, even failing to carry out basic tasks such as counting molecular rings. To address this limitation, we introduce CLEANMOL, a novel framework that formulates SMILES parsing into a suite of clean and deterministic tasks explicitly designed to promote graph-level molecular comprehension. These tasks span from subgraph matching to global graph matching, providing structured supervision aligned with molecular structural properties. We construct a molecular pretraining dataset with adaptive difficulty scoring and pre-train open-source LLMs on these tasks. Our results show that CLEANMOL not only enhances structural comprehension but also achieves the best or competes with the baseline on the Mol-Instructions benchmark.

Improving Chemical Understanding of LLMs via SMILES Parsing

TL;DR

The paper tackles the challenge that LLMs struggle to interpret SMILES representations, which hampers accurate molecular understanding. It introduces CLEANMOL, a framework that defines five deterministic SMILES parsing tasks—spanning subgraph and global graph information—together with a 250K-molecule dataset annotated automatically via RDKit. A two-stage training pipeline combines task-adaptive data pruning and curriculum learning to pretrain on these parsing tasks and then fine-tune on downstream chemistry tasks, yielding improvements on Mol-Instructions benchmarks and molecular generation. The results demonstrate that explicit, structure-focused supervision can transfer to generation and other downstream tasks, offering a scalable path toward more structurally grounded molecular LLMs with potential impact on drug discovery and materials design.

Abstract

Large language models (LLMs) are increasingly recognized as powerful tools for scientific discovery, particularly in molecular science. A fundamental requirement for these models is the ability to accurately understand molecular structures, commonly encoded in the SMILES representation. However, current LLMs struggle to interpret SMILES, even failing to carry out basic tasks such as counting molecular rings. To address this limitation, we introduce CLEANMOL, a novel framework that formulates SMILES parsing into a suite of clean and deterministic tasks explicitly designed to promote graph-level molecular comprehension. These tasks span from subgraph matching to global graph matching, providing structured supervision aligned with molecular structural properties. We construct a molecular pretraining dataset with adaptive difficulty scoring and pre-train open-source LLMs on these tasks. Our results show that CLEANMOL not only enhances structural comprehension but also achieves the best or competes with the baseline on the Mol-Instructions benchmark.

Paper Structure

This paper contains 60 sections, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Overview of SMILES parsing. (\ref{['fig:2_smiles_parsing_task']}) Each column visualizes one of the five SMILES parsing tasks: functional group matching, ring counting, carbon chain length measurement, SMILES canonicalization, and fragment assembly. The highlighted tokens in the SMILES correspond to the substructures involved in each task. (\ref{['fig:2_failure']}) Recent LLMs fail for SMILES parsing while the model trained with our $\textsc{CleanMol}$ shows improvement.
  • Figure 2: Complex cases in SMILES parsing. The top green panels represent relatively simple cases, while the bottom red panels illustrate more complex examples with non-continuous substructures in SMILES. Orange and teal highlights correspond to tasks involving ring counting and functional group matching, respectively.
  • Figure 3: Examples of $\textsc{CleanMol}$ dataset.
  • Figure 4: Overview of molecular data pruning and ranking. Each number represents the task-specific difficulty score assigned to a molecule, as defined in \ref{['tab:difficulty']}. For each parsing task, molecules are ranked based on these scores and we select the mid-difficulty samples.
  • Figure 5: Data scale analysis for SMILES parsing.
  • ...and 6 more figures