Table of Contents
Fetching ...

MolSpectLLM: A Molecular Foundation Model Bridging Spectroscopy, Molecule Elucidation, and 3D Structure Generation

Shuaike Shen, Jiaqing Xie, Zhuo Yang, Antong Zhang, Shuzhou Sun, Ben Gao, Tianfan Fu, Biqing Qi, Yuqiang Li

TL;DR

MolSpectLLM addresses the SMILES-centric limitation of molecular foundation models by unifying experimental spectroscopy with 3D structure generation. Built on a 7B decoder-only LM (Qwen2.5-7B), it uses standardized textual descriptions of $^{1}$H/$^{13}$C NMR, IR, and MS to enable spectrum-oriented reasoning and integrates 3D coordinate generation from textual inputs or spectra. It achieves state-of-the-art performance on spectrum-related tasks (spectra-to-SMILES, SMILES-to-Spectra) and strong results on name conversion, Molecule QA, and, notably, direct 3D structure generation with high validity and geometry fidelity. This multimodal, spectroscopy-informed foundation model bridges spectral analysis, molecular elucidation, and 3D design, with practical implications for drug discovery and materials science, while illustrating effective use of three-phase training and LoRA-based instruction following.

Abstract

Recent advances in molecular foundation models have shown impressive performance in molecular property prediction and de novo molecular design, with promising applications in areas such as drug discovery and reaction prediction. Nevertheless, most existing approaches rely exclusively on SMILES representations and overlook both experimental spectra and 3D structural information-two indispensable sources for capturing molecular behavior in real-world scenarios. This limitation reduces their effectiveness in tasks where stereochemistry, spatial conformation, and experimental validation are critical. To overcome these challenges, we propose MolSpectLLM, a molecular foundation model pretrained on Qwen2.5-7B that unifies experimental spectroscopy with molecular 3D structure. By explicitly modeling molecular spectra, MolSpectLLM achieves state-of-the-art performance on spectrum-related tasks, with an average accuracy of 0.53 across NMR, IR, and MS benchmarks. MolSpectLLM also shows strong performance on the spectra analysis task, obtaining 15.5% sequence accuracy and 41.7% token accuracy on Spectra-to-SMILES, substantially outperforming large general-purpose LLMs. More importantly, MolSpectLLM not only achieves strong performance on molecular elucidation tasks, but also generates accurate 3D molecular structures directly from SMILES or spectral inputs, bridging spectral analysis, molecular elucidation, and molecular design. Code are available at \href{https://github.com/Eurekashen/MolSpectLLM}{https://github.com/Eurekashen/MolSpectLLM}.

MolSpectLLM: A Molecular Foundation Model Bridging Spectroscopy, Molecule Elucidation, and 3D Structure Generation

TL;DR

MolSpectLLM addresses the SMILES-centric limitation of molecular foundation models by unifying experimental spectroscopy with 3D structure generation. Built on a 7B decoder-only LM (Qwen2.5-7B), it uses standardized textual descriptions of H/C NMR, IR, and MS to enable spectrum-oriented reasoning and integrates 3D coordinate generation from textual inputs or spectra. It achieves state-of-the-art performance on spectrum-related tasks (spectra-to-SMILES, SMILES-to-Spectra) and strong results on name conversion, Molecule QA, and, notably, direct 3D structure generation with high validity and geometry fidelity. This multimodal, spectroscopy-informed foundation model bridges spectral analysis, molecular elucidation, and 3D design, with practical implications for drug discovery and materials science, while illustrating effective use of three-phase training and LoRA-based instruction following.

Abstract

Recent advances in molecular foundation models have shown impressive performance in molecular property prediction and de novo molecular design, with promising applications in areas such as drug discovery and reaction prediction. Nevertheless, most existing approaches rely exclusively on SMILES representations and overlook both experimental spectra and 3D structural information-two indispensable sources for capturing molecular behavior in real-world scenarios. This limitation reduces their effectiveness in tasks where stereochemistry, spatial conformation, and experimental validation are critical. To overcome these challenges, we propose MolSpectLLM, a molecular foundation model pretrained on Qwen2.5-7B that unifies experimental spectroscopy with molecular 3D structure. By explicitly modeling molecular spectra, MolSpectLLM achieves state-of-the-art performance on spectrum-related tasks, with an average accuracy of 0.53 across NMR, IR, and MS benchmarks. MolSpectLLM also shows strong performance on the spectra analysis task, obtaining 15.5% sequence accuracy and 41.7% token accuracy on Spectra-to-SMILES, substantially outperforming large general-purpose LLMs. More importantly, MolSpectLLM not only achieves strong performance on molecular elucidation tasks, but also generates accurate 3D molecular structures directly from SMILES or spectral inputs, bridging spectral analysis, molecular elucidation, and molecular design. Code are available at \href{https://github.com/Eurekashen/MolSpectLLM}{https://github.com/Eurekashen/MolSpectLLM}.

Paper Structure

This paper contains 64 sections, 12 equations, 12 figures, 4 tables, 7 algorithms.

Figures (12)

  • Figure 1: Pipeline of MolSpectLLM. The training of MolSpectLLM consists of three stages. During pretraining, we leverage publicly available chemical literature and unified textual descriptions constructed from PubChem, QM9S, and Multi-modal Spectrum. Then we perform instruction tuning on curated instruction datasets, followed by lightweight LoRA adaptation on a small set of template-based data to facilitate evaluation. Further details are provided in Section \ref{['sec:training']} and \ref{['app:spec_data_process']}.
  • Figure 2: Standard textual description for different spectrum types. Instead of using raw spectral vectors, we design spectrum-specific feature extraction pipelines and convert the results into structured textual formats for LLM consumption. Details of the data processing are described in Sec. \ref{['sec:spectra_token']} and Appendix \ref{['app:spec_data_process']}.
  • Figure 3: Exmaple of SMILES-to-3D. MolSpectLLM is able to generate accurate 3D structure based on the given SMILES string.
  • Figure 4: Results on the Spectra-to-SMILES task with evaluation metrics including token accuracy, sequence accuracy, FP similarity, and structural similarity.
  • Figure 5: Example of Spectra-to-SMILES. MolSpectLLM infers the corresponding molecular SMILES from multiple given spectra.
  • ...and 7 more figures