Structural Reasoning Improves Molecular Understanding of LLM

Yunhui Jang; Jaehyung Kim; Sungsoo Ahn

Structural Reasoning Improves Molecular Understanding of LLM

Yunhui Jang, Jaehyung Kim, Sungsoo Ahn

TL;DR

This work identifies a persistent gap in LLMs’ ability to reason about molecular structure and demonstrates that explicit structural reasoning is essential for accurate molecular understanding. It introduces Molecular Structural Reasoning (MSR), a two-stage framework with a reasoning module and an answering module that handles analytic and synthetic scenarios, leveraging external tools like RDKit for deterministic structure extraction. Across molecule-to-text, retrosynthesis, and text-to-molecule tasks, MSR yields consistent improvements for both chemical and general LLMs, with notable gains in description quality, synthesis accuracy, and generation fidelity, including state-of-the-art performance in several settings. The results underscore the value of explicit, componentized structural reasoning in domain-specific LLMs, while also highlighting ablations and limitations—such as partial difficulties with certain structural elements and descriptor interactions—that guide future improvements and ensure reproducibility.

Abstract

Recently, large language models (LLMs) have shown significant progress, approaching human perception levels. In this work, we demonstrate that despite these advances, LLMs still struggle to reason using molecular structural information. This gap is critical because many molecular properties, including functional groups, depend heavily on such structural details. To address this limitation, we propose an approach that sketches molecular structures for reasoning. Specifically, we introduce Molecular Structural Reasoning (MSR) framework to enhance the understanding of LLMs by explicitly incorporating the key structural features. We present two frameworks for scenarios where the target molecule is known or unknown. We verify that our MSR improves molecular understanding through extensive experiments.

Structural Reasoning Improves Molecular Understanding of LLM

TL;DR

Abstract

Paper Structure (57 sections, 18 figures, 14 tables)

This paper contains 57 sections, 18 figures, 14 tables.

Introduction
Recent large language models do not understand structural information
Molecular formula.
Longest carbon chain.
Aromatic rings.
Ring compounds.
Functional groups.
Chiral centers.
$\textsc{MSR}$: Molecular Structural Reasoning
Overview of $\textsc{MSR}$
Analytic reasoning
Reasoning module.
Answering module.
Synthetic reasoning
Reasoning module.
...and 42 more sections

Figures (18)

Figure 1: Overview of LLMs with structural information. (\ref{['fig: failure_case']}) Each color in $\textsc{MSR}$ represents a structural component. The top molecule is incorrectly generated using only the description while the bottom is correctly generated by incorporating the description and $\textsc{MSR}$. (\ref{['fig: analysis']}) Despite the importance of structural information, even recent LLMs struggle to accurately infer key structural details from molecular representations such as SMILES (Molecule-to-structure; M2S) or given descriptions (Text-to-structure; T2S).
Figure 2: The six key structural information: molecular formula, longest carbon chain length, aromatic rings, ring compounds, functional groups, and chirality. The same color indicates the structural information and the corresponding component of the molecule.
Figure 3: Illustration of the importance of structural information. This example shows how replacing each structural information (dashed box) alters the molecule. Colors correspond to the structural elements in \ref{['fig: cot_structure']}.
Figure 4: Overview of $\textsc{MSR}$ fine-tuning framework. Analytic reasoning applies when the input molecule is available, while synthetic reasoning applies when it is not. Light gray boxes denote the molecules (SMILES); gray boxes denote related description; colored boxes represent $\textsc{MSR}$. The yellow and the red modules perform reasoning and answering, respectively. In (\ref{['fig: train_given_mol']}), yellow module indicates an external tool. In (\ref{['fig: train_wo_mol']}), colors indicate $\textsc{MSR}$ and the corresponding structural elements; here, the third molecule is chosen after matching ratio-based rejection sampling according to its highest matching ratio (3/3).
Figure 5: An example of generated samples for molecule-to-text. We observe that $\textsc{MSR}$ improves the accuracy of detailed molecular information (highlighted in yellow). We provide more examples in \ref{['appx: sample']}.
...and 13 more figures

Structural Reasoning Improves Molecular Understanding of LLM

TL;DR

Abstract

Structural Reasoning Improves Molecular Understanding of LLM

Authors

TL;DR

Abstract

Table of Contents

Figures (18)