Table of Contents
Fetching ...

SpectraLLM: Uncovering the Ability of LLMs for Molecule Structure Elucidation from Multi-Spectral

Yunyue Su, Jiahui Chen, Zao Jiang, Zhenyi Zhong, Liang Wang, Qiang Liu, Zhaoxiang Zhang

Abstract

Automated molecular structure elucidation remains challenging, as existing approaches often depend on pre-compiled databases or restrict themselves to single spectroscopic modalities. Here we introduce \textbf{SpectraLLM}, a large language model that performs end-to-end structure prediction by reasoning over one or multiple spectra. Unlike conventional spectrum-to-structure pipelines, SpectraLLM represents both continuous (IR, Raman, UV-Vis, NMR) and discrete (MS) modalities in a shared language space, enabling it to capture substructural patterns that are complementary across different spectral types. We pretrain and fine-tune the model on small-molecule domains and evaluate it on four public benchmark datasets. SpectraLLM achieves state-of-the-art performance, substantially surpassing single-modality baselines. Moreover, it demonstrates strong robustness in unimodal settings and further improves prediction accuracy when jointly reasoning over diverse spectra, establishing a scalable paradigm for language-based spectroscopic analysis. Code is available at https://github.com/OPilgrim/SpectraLLM.

SpectraLLM: Uncovering the Ability of LLMs for Molecule Structure Elucidation from Multi-Spectral

Abstract

Automated molecular structure elucidation remains challenging, as existing approaches often depend on pre-compiled databases or restrict themselves to single spectroscopic modalities. Here we introduce \textbf{SpectraLLM}, a large language model that performs end-to-end structure prediction by reasoning over one or multiple spectra. Unlike conventional spectrum-to-structure pipelines, SpectraLLM represents both continuous (IR, Raman, UV-Vis, NMR) and discrete (MS) modalities in a shared language space, enabling it to capture substructural patterns that are complementary across different spectral types. We pretrain and fine-tune the model on small-molecule domains and evaluate it on four public benchmark datasets. SpectraLLM achieves state-of-the-art performance, substantially surpassing single-modality baselines. Moreover, it demonstrates strong robustness in unimodal settings and further improves prediction accuracy when jointly reasoning over diverse spectra, establishing a scalable paradigm for language-based spectroscopic analysis. Code is available at https://github.com/OPilgrim/SpectraLLM.

Paper Structure

This paper contains 33 sections, 8 equations, 6 figures, 30 tables.

Figures (6)

  • Figure 1: Overview of the training pipeline for structure elucidation. Characteristic spectral peaks are extracted from raw IR, Raman, UV, NMR, or MS data and used to construct natural language prompts. These are input to a frozen large language model fine-tuned via LoRA. The model is trained to autoregressively generate molecular structures in SMILES format, supervised by the ground-truth sequence.
  • Figure 2: Case studies illustrating complementary roles of Raman and IR spectra in joint structure elucidation. Top: Three representative examples where incorporating Raman spectra corrects mispredictions made using only IR or UV-Vis inputs, reflecting Raman’s sensitivity to polarizability-dependent substructures. Bottom: Three representative examples where IR spectra are indispensable for identifying carbonyl groups and resolving branched chain configurations; without IR input, predictions based on Raman and UV-Vis remain ambiguous or incorrect.
  • Figure 3: Molecular Property Distributions of the Dataset. This panel of three histograms summarizes critical molecular properties of the dataset:Molecular Weight (MW) Distribution (Left): Shows the frequency of molecules across a weight range (0–2500 Da), with a concentration around 500 Da (consistent with typical small-molecule datasets). Heavy Atom Count (HAC) Distribution (Middle): Characterizes the number of non-hydrogen atoms per molecule, peaking at 25 heavy atoms (reflecting moderate molecular complexity). LogP Distribution (Right): Depicts the octanol-water partition coefficient (a measure of lipophilicity), with a sharp peak near 0 (indicating a skew toward moderately hydrophilic molecules).
  • Figure 4: Functional Group Distribution in the Dataset. This bar plot quantifies the prevalence of key functional groups across the 943,729 molecules in our training dataset. The y-axis represents the number of molecules containing each functional group, with the corresponding percentage (relative to the total dataset size) labeled atop each bar.
  • Figure 5: Qualitative counterfactual examples obtained by deleting diagnostic MS fragment peaks. Left group: Ground-truth molecule CC(C)OC(=O)C(C)C (left) and SpectraLLM prediction CCCCCC=O (right) after removing the ester-diagnostic m/z = 43 isopropyl cation peak from the input spectrum. The model preserves a terminal carbonyl but no longer predicts an ester or branched isopropyl substituent. Right group: Ground truth CN(C)C(C)=O (left) and prediction [N-]=[N+]=NC[C@@H]1CN1 (right) after removing the m/z = 43 [CH3CO]^+ acetyl fragment. The predicted structure matches the overall mass but eliminates oxygen-containing fragments consistent with m/z = 43.
  • ...and 1 more figures