Table of Contents
Fetching ...

SpecTUS: Spectral Translator for Unknown Structures annotation from EI-MS spectra

Adam Hájek, Helge Hecht, Elliott J. Price, Aleš Křenek

TL;DR

This work proposes SpecTUS: Spectral Translator for Unknown Structures, a deep neural model that addresses the task of structural annotation of small molecules from low-resolution gas chromatography electron ionization mass spectra (GC-EI-MS).

Abstract

Compound identification and structure annotation from mass spectra is a well-established task widely applied in drug detection, criminal forensics, small molecule biomarker discovery and chemical engineering. We propose SpecTUS: Spectral Translator for Unknown Structures, a deep neural model that addresses the task of structural annotation of small molecules from low-resolution gas chromatography electron ionization mass spectra (GC-EI-MS). Our model analyzes the spectra in \textit{de novo} manner -- a direct translation from the spectra into 2D-structural representation. Our approach is particularly useful for analyzing compounds unavailable in spectral libraries. In a rigorous evaluation of our model on the novel structure annotation task across different libraries, we outperformed standard database search techniques by a wide margin. On a held-out testing set, including \numprint{28267} spectra from the NIST database, we show that our model's single suggestion perfectly reconstructs 43\% of the subset's compounds. This single suggestion is strictly better than the candidate of the database hybrid search (common method among practitioners) in 76\% of cases. In a~still affordable scenario of~10 suggestions, perfect reconstruction is achieved in 65\%, and 84\% are better than the hybrid search.

SpecTUS: Spectral Translator for Unknown Structures annotation from EI-MS spectra

TL;DR

This work proposes SpecTUS: Spectral Translator for Unknown Structures, a deep neural model that addresses the task of structural annotation of small molecules from low-resolution gas chromatography electron ionization mass spectra (GC-EI-MS).

Abstract

Compound identification and structure annotation from mass spectra is a well-established task widely applied in drug detection, criminal forensics, small molecule biomarker discovery and chemical engineering. We propose SpecTUS: Spectral Translator for Unknown Structures, a deep neural model that addresses the task of structural annotation of small molecules from low-resolution gas chromatography electron ionization mass spectra (GC-EI-MS). Our model analyzes the spectra in \textit{de novo} manner -- a direct translation from the spectra into 2D-structural representation. Our approach is particularly useful for analyzing compounds unavailable in spectral libraries. In a rigorous evaluation of our model on the novel structure annotation task across different libraries, we outperformed standard database search techniques by a wide margin. On a held-out testing set, including \numprint{28267} spectra from the NIST database, we show that our model's single suggestion perfectly reconstructs 43\% of the subset's compounds. This single suggestion is strictly better than the candidate of the database hybrid search (common method among practitioners) in 76\% of cases. In a~still affordable scenario of~10 suggestions, perfect reconstruction is achieved in 65\%, and 84\% are better than the hybrid search.

Paper Structure

This paper contains 10 sections, 6 equations, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Percentage of cases where database search methods (SSS and HSS) successfully retrieved the closest structure from the reference database among the top-1, top-10, and top-50 suggested candidates. Performance is evaluated across all test sets: NIST test split, SWGDRUG, Cayman, and MONA.
  • Figure 2: Comparison average similarity and accuracy metrics across all tested methods, testing datasets, and three retrieval scenarios (1, 10, and 50 candidates). ST represents SpecTUS, while the abbreviations of the baseline database search methods (SSS, HSS) are explained in the text -- Section \ref{['s:baselines']}. Sim$_k$ is displayed by the color bars, similarity of the theoretical upper bound for database search methods (BDC) is expressed as a pink dashed line. $\text{Acc}_k$ values for SpecTUS are shown in black lines, they are inherently zero for all the other methods. The three $\text{Acc}_k$ values in each column correspond to 1, 10, and 50 candidates, displayed from bottom to top.
  • Figure 3: Comparison of win rate and at-least-as-good rate of SpecTUS over database search method HSS (A) and a theoretical database search upper bound BDC (B) across all testing datasets, and three retrieval scenarios (1, 10, and 50 candidates). The diagram evaluates corresponding retrieval scenarios, such as SpecTUS$_{10}$ versus HSS$_{10}$ (Win$(\text{SpecTUS}_{10}, \text{HSS}_{10}$), providing a direct comparison of performance under identical conditions.
  • Figure 4: The scatterplot on the left illustrates the Sim$_{10}$ scores for 200 randomly sampled queries from the NIST test set, comparing SpecTUS and HSS predictions. Each point represents a single query, with its position determined by the Sim$_{10}$ scores of SpecTUS (x-axis) and HSS (y-axis). The dashed line indicates where both models achieved identical Sim$_{10}$ values. Notably, 65% of the SpecTUS predictions reached a perfect Tanimoto similarity of 1. Highlighted in red are five specific examples, further detailed in the table on the right. For each example, the ground truth molecule (Label) and the predictions from HSS and SpecTUS are shown, along with their Tanimoto similarity (T) to the label, computed using Morgan fingerprints. For faster comparison, correctly predicted regions are marked with green ellipses, while errors are enclosed in red ellipses. These examples were hand-picked to illustrate typical errors and highlight specific regions of the scatterplot.
  • Figure 5: Overview of the SpecTUS method. The diagram illustrates relationships of all models (blue), datasets (grey) and training stages (green) involved in constructing SpecTUS. It also highlights the final inference process (red), showing how the model transitioned from training to producing ranked molecular predictions.
  • ...and 11 more figures