Table of Contents
Fetching ...

Structural Reasoning Improves Molecular Understanding of LLM

Yunhui Jang, Jaehyung Kim, Sungsoo Ahn

TL;DR

This work identifies a persistent gap in LLMs’ ability to reason about molecular structure and demonstrates that explicit structural reasoning is essential for accurate molecular understanding. It introduces Molecular Structural Reasoning (MSR), a two-stage framework with a reasoning module and an answering module that handles analytic and synthetic scenarios, leveraging external tools like RDKit for deterministic structure extraction. Across molecule-to-text, retrosynthesis, and text-to-molecule tasks, MSR yields consistent improvements for both chemical and general LLMs, with notable gains in description quality, synthesis accuracy, and generation fidelity, including state-of-the-art performance in several settings. The results underscore the value of explicit, componentized structural reasoning in domain-specific LLMs, while also highlighting ablations and limitations—such as partial difficulties with certain structural elements and descriptor interactions—that guide future improvements and ensure reproducibility.

Abstract

Recently, large language models (LLMs) have shown significant progress, approaching human perception levels. In this work, we demonstrate that despite these advances, LLMs still struggle to reason using molecular structural information. This gap is critical because many molecular properties, including functional groups, depend heavily on such structural details. To address this limitation, we propose an approach that sketches molecular structures for reasoning. Specifically, we introduce Molecular Structural Reasoning (MSR) framework to enhance the understanding of LLMs by explicitly incorporating the key structural features. We present two frameworks for scenarios where the target molecule is known or unknown. We verify that our MSR improves molecular understanding through extensive experiments.

Structural Reasoning Improves Molecular Understanding of LLM

TL;DR

This work identifies a persistent gap in LLMs’ ability to reason about molecular structure and demonstrates that explicit structural reasoning is essential for accurate molecular understanding. It introduces Molecular Structural Reasoning (MSR), a two-stage framework with a reasoning module and an answering module that handles analytic and synthetic scenarios, leveraging external tools like RDKit for deterministic structure extraction. Across molecule-to-text, retrosynthesis, and text-to-molecule tasks, MSR yields consistent improvements for both chemical and general LLMs, with notable gains in description quality, synthesis accuracy, and generation fidelity, including state-of-the-art performance in several settings. The results underscore the value of explicit, componentized structural reasoning in domain-specific LLMs, while also highlighting ablations and limitations—such as partial difficulties with certain structural elements and descriptor interactions—that guide future improvements and ensure reproducibility.

Abstract

Recently, large language models (LLMs) have shown significant progress, approaching human perception levels. In this work, we demonstrate that despite these advances, LLMs still struggle to reason using molecular structural information. This gap is critical because many molecular properties, including functional groups, depend heavily on such structural details. To address this limitation, we propose an approach that sketches molecular structures for reasoning. Specifically, we introduce Molecular Structural Reasoning (MSR) framework to enhance the understanding of LLMs by explicitly incorporating the key structural features. We present two frameworks for scenarios where the target molecule is known or unknown. We verify that our MSR improves molecular understanding through extensive experiments.
Paper Structure (57 sections, 18 figures, 14 tables)

This paper contains 57 sections, 18 figures, 14 tables.

Figures (18)

  • Figure 1: Overview of LLMs with structural information. (\ref{['fig: failure_case']}) Each color in $\textsc{MSR}$ represents a structural component. The top molecule is incorrectly generated using only the description while the bottom is correctly generated by incorporating the description and $\textsc{MSR}$. (\ref{['fig: analysis']}) Despite the importance of structural information, even recent LLMs struggle to accurately infer key structural details from molecular representations such as SMILES (Molecule-to-structure; M2S) or given descriptions (Text-to-structure; T2S).
  • Figure 2: The six key structural information: molecular formula, longest carbon chain length, aromatic rings, ring compounds, functional groups, and chirality. The same color indicates the structural information and the corresponding component of the molecule.
  • Figure 3: Illustration of the importance of structural information. This example shows how replacing each structural information (dashed box) alters the molecule. Colors correspond to the structural elements in \ref{['fig: cot_structure']}.
  • Figure 4: Overview of $\textsc{MSR}$ fine-tuning framework. Analytic reasoning applies when the input molecule is available, while synthetic reasoning applies when it is not. Light gray boxes denote the molecules (SMILES); gray boxes denote related description; colored boxes represent $\textsc{MSR}$. The yellow and the red modules perform reasoning and answering, respectively. In (\ref{['fig: train_given_mol']}), yellow module indicates an external tool. In (\ref{['fig: train_wo_mol']}), colors indicate $\textsc{MSR}$ and the corresponding structural elements; here, the third molecule is chosen after matching ratio-based rejection sampling according to its highest matching ratio (3/3).
  • Figure 5: An example of generated samples for molecule-to-text. We observe that $\textsc{MSR}$ improves the accuracy of detailed molecular information (highlighted in yellow). We provide more examples in \ref{['appx: sample']}.
  • ...and 13 more figures