Generative structural elucidation from mass spectra as an iterative optimization problem

Mrunali Manjrekar; Runzhong Wang; Samuel Goldman; Jenna C. Fromer; Connor W. Coley

Generative structural elucidation from mass spectra as an iterative optimization problem

Mrunali Manjrekar, Runzhong Wang, Samuel Goldman, Jenna C. Fromer, Connor W. Coley

TL;DR

FOAM reframes structure elucidation from LC-MS/MS as an iterative, formula-constrained optimization problem, leveraging a graph genetic algorithm and a spectrum predictor to search chemically feasible annotations beyond fixed libraries. By maximizing spectral similarity while penalizing structural complexity via NSGA-II-based Pareto ranking, FOAM generates and refines candidate structures across generations to recover true molecules or closely related decoys. Evaluations on NIST'20 and MassSpecGym show FOAM can encounter the true structure in a substantial fraction of runs and significantly boost top-10 candidate quality when combined with existing elucidation methods, with success strongly tied to seed relevance and the accuracy of the spectral oracle. The framework is modular and extensible, enabling integration of additional context signals (e.g., retention time, biosynthetic feasibility) and uncertainty-aware selection to further improve de novo structure elucidation workflows.

Abstract

Liquid chromatography tandem mass spectrometry (LC-MS/MS) is a critical analytical technique for molecular identification across metabolomics, environmental chemistry, and chemical forensics. A variety of computational methods have emerged for structural annotation of spectral features of interest, but many of these features cannot be confidently annotated with reference structures or spectra. Here, we introduce FOAM (Formula-constrained Optimization for Annotating Metabolites), a computational workflow that poses structure elucidation from LC-MS/MS as an iterative optimization problem. FOAM couples a formula-constrained graph genetic algorithm with spectral simulation to explore candidate annotations given an experimental spectrum. We demonstrate FOAM's performance on the NIST'20 and MassSpecGym datasets as both a standalone elucidation pipeline and as a complement to existing inverse models. This work establishes iterative optimization as an effective and extensible paradigm for structural elucidation.

Generative structural elucidation from mass spectra as an iterative optimization problem

TL;DR

Abstract

Paper Structure (16 sections, 1 equation, 10 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 1 equation, 10 figures, 3 tables, 1 algorithm.

FOAM implementation details
Seeding cache
Crossover and mutation operations
Extended benchmarking details
NIST'20 evaluation
ICEBERG training
NIST'20 benchmarking details
MassSpecGym benchmarking details
Runtime
Parameter choice
NIST evaluation parameters
Parameter choice
MassSpecGym evaluation parameters
Additional evaluations
Review of candidate sets after longer generation count (NIST'20)
...and 1 more sections

Figures (10)

Figure 1: (a) We view structural elucidation as a special case of the molecular optimization problem common to computer-aided molecular design strategies; our goal is to discover the molecule that best explains the experimental spectrum. A spectrum simulation model Wang2025-tc and corresponding spectral similarity objective assume the role of a property prediction oracle. Spectral similarity is a shared pursuit of retrieval-based methods, which mirror virtual screening approaches in molecular design for identifying optimal candidates within a fixed list of candidates. Conditional generative models provide one option to propose new molecules de novo for both drug design and structure elucidation Stravs2022-eoDBLP:conf/icml/BohdeMWJC25. Similarly, framing elucidation as an iterative optimization problem enables guidance by the spectral similarity objective in the same manner as iterative molecular optimization. (b) Overview of the FOAM method. Given a target spectrum $T$, seed structures with matching formulae are first collected (for example, retrieved from a existing database or proposed by another elucidation tool) (Step 1). These structures are fragmented with a spectral simulator (Step 2), ICEBERG Goldman2024-nwWang2025-tc, and compared to the target spectrum to compute their spectral similarities (Step 3). These similarities are considered alongside structural complexity (SAScore) Ertl2009-rq to calculate the Pareto ranking of candidates via non-dominated sorting. FOAM selects the top-scoring candidates to form the mating pool (Step 4) of parents for the next generation of offspring; FOAM then applies formula-constrained crossover and mutation operations to this pool to generate new candidates that maintain the desired formula (Step 5). The process repeats until the termination criterion is reached (Step 6), and the top-k candidates are extracted.
Figure 1: Additional evaluations of FOAM on the MassSpecGym dataset. (a) Histogram of unique samples collected from DiffMS; 100 (possibly overlapping) samples were collected per spectrum. (b) Distribution of simulated spectral similarity for the true structure for spectra in the MassSpecGym dataset where collision energy (eV) was either already annotated (known) or was missing (missing). (c) Comparison between statistics from DiffMS to FOAM for encounter rate, and count of close matches in the top 1 and top 10 candidate sets as defined by Butler2023-pt. (d) Comparison of spectral similarity averages and Top 10 match rates for NIST'20 vs MassSpecGym, which are dramatically lower when the collision energy needed to be imputed.
Figure 2: Performance of FOAM on a random test subset of NIST'20 as adopted from ICEBERG Wang2025-tc. (a) The primary objective function being maximized, the spectral similarity between the given experimental spectrum and the predicted spectrum of a proposed structure, exhibits a steady increase over the course of 60 generations. (b) FOAM proposes the true molecule 68% of the time, irrespective of how the structure ranks among the full list of candidates. (c) Structural similarity, when taking only the top-scoring candidates as ranked by predicted spectral similarity, peaks at generation 2 (Best of 1) or 3 (Best of 10) and slowly decays afterward, reflecting the accumulation of competing decoy structures with higher predicted spectral similarity. (d) After only three generations, the true molecule is ranked first 11% of the time and in the top 10 proposals 31% of the time. The maximum values these can be are upper-bounded by the encounter rates in (b), as they are dependent on having seen the true molecule. Generations are plotted on a symmetric log scale. All metrics are plotted with 99.9% confidence intervals, computed using bootstrapping (n=5,000) estimation of the mean.
Figure 3: Contributions of seed similarity and spectral prediction accuracy to FOAM's performance, as defined by the composition of its top 10 candidate sets. (a) Best structural similarities observed in the top 10 candidate sets after three generations versus accuracy of the predicted spectrum of the true molecule (x-axis) and the similarity of the closest seed structure (2048-bit Morgan fingerprint, Tanimoto similarity) (y-axis). (b) Stratification of test spectra examples by starting seed similarity. Pink distributions show the structural similarity of the top 10 candidate sets after three generations. For visual reference, the seed similarity bins are also represented by corresponding gray range blocks for each density plot. The right-hand column quantifies the appearances of Top 10 Exact Matches in each bin. (c) Stratification of test spectra examples by spectral similarity of the true structure's predicted spectrum to the experimental spectrum. Note that since the density plots show the spread of structural similarity, there are no corresponding gray range blocks to denote the bins. The bottom row quantifies the appearances of Top 10 Exact Matches in each bin. (d) Performance of a logistic regression model (5-fold cross-validation) predicting whether there is a Top 10 Exact Match in the final candidate set. The model achieves 0.841 AUROC using only information that would be available in prospective applications (maximum observed spectral similarity and its corresponding structure's SAScore, adduct type, count of collision energies acquired, generation count, and number of seed structures).
Figure 4: Two example FOAM trajectories and selected candidate structures for selected generations. (a) The molecular formula, adduct, and target spectrum merged over the 12 available collision energies (5%, 9%, 14%, 20%, 30%, 40%, 50%, 65%, 80%, 95%, 110%, 130%). (b) Distribution of objectives of the seed population (Generation 0), as well as the candidate populations after one and four generations. (c) The true structure 1 and the top two structures 2-6 from each generation ranked by spectral similarity. The true structure is found, and ranked first, in Generation 4. (d) The molecular formula, adduct, and target spectrum merged over the 11 available collision energies (7%, 11%, 15%, 19%, 25%, 30%, 36%, 42%, 50%, 57%, and 69%).(e) Distribution of objectives as in (b) but for the search of 7, with Generations 0, 4 and 6 visualized. (f) The true structure 7 and top-ranked candidates by spectral similarity. Generations 0 and 4 display their top two structures 8-11; Generation 6 displays the top structure 12 and 28th structure. The true molecule 7 is found in Generation 6 at rank 28, which is displayed in place of the 2nd ranked molecule for this generation. Molecules are colored by their structural similarity (Tanimoto similarity, 2048-bit Morgan fingerprint) to the true molecule. Gray dashed lines mark the true molecule's ICEBERG-predicted spectral similarity to the experimental spectrum (vertical line) and the molecule's SAScore (horizontal line).
...and 5 more figures

Generative structural elucidation from mass spectra as an iterative optimization problem

TL;DR

Abstract

Generative structural elucidation from mass spectra as an iterative optimization problem

Authors

TL;DR

Abstract

Table of Contents

Figures (10)