When should we trust the annotation? Selective prediction for molecular structure retrieval from mass spectra

Mira Jürgens; Gaetan De Waele; Morteza Rakhshaninejad; Willem Waegeman

When should we trust the annotation? Selective prediction for molecular structure retrieval from mass spectra

Mira Jürgens, Gaetan De Waele, Morteza Rakhshaninejad, Willem Waegeman

TL;DR

A selective prediction framework for molecular structure retrieval from MS/MS spectra is introduced, enabling models to abstain from predictions when uncertainty is too high, and it is demonstrated that by applying distribution-free risk control via generalization bounds, practitioners can specify a tolerable error rate and obtain a subset of annotations satisfying that constraint with high probability.

Abstract

Machine learning methods for identifying molecular structures from tandem mass spectra (MS/MS) have advanced rapidly, yet current approaches still exhibit significant error rates. In high-stakes applications such as clinical metabolomics and environmental screening, incorrect annotations can have serious consequences, making it essential to determine when a prediction can be trusted. We introduce a selective prediction framework for molecular structure retrieval from MS/MS spectra, enabling models to abstain from predictions when uncertainty is too high. We formulate the problem within the risk-coverage tradeoff framework and comprehensively evaluate uncertainty quantification strategies at two levels of granularity: fingerprint-level uncertainty over predicted molecular fingerprint bits, and retrieval-level uncertainty over candidate rankings. We compare scoring functions including first-order confidence measures, aleatoric and epistemic uncertainty estimates from second-order distributions, as well as distance-based measures in the latent space. All experiments are conducted on the MassSpecGym benchmark. Our analysis reveals that while fingerprint-level uncertainty scores are poor proxies for retrieval success, computationally inexpensive first-order confidence measures and retrieval-level aleatoric uncertainty achieve strong risk-coverage tradeoffs across evaluation settings. We demonstrate that by applying distribution-free risk control via generalization bounds, practitioners can specify a tolerable error rate and obtain a subset of annotations satisfying that constraint with high probability.

When should we trust the annotation? Selective prediction for molecular structure retrieval from mass spectra

TL;DR

Abstract

Paper Structure (26 sections, 22 equations, 7 figures, 4 tables)

This paper contains 26 sections, 22 equations, 7 figures, 4 tables.

Introduction
Methods
Fingerprint-based molecular retrieval
Selective prediction
Scoring functions
Risk control with statistical guarantees
Evaluation
Experimental setup
Results and discussion
Baseline retrieval performance
Risk-coverage curves
Candidate set size analysis
Coverage under risk control
Conclusion
Uncertainty metrics
...and 11 more sections

Figures (7)

Figure 1: Overview of the selective prediction framework for molecular structure retrieval from tandem mass spectra. (a) Fingerprint-based molecular retrieval. A tandem mass spectrum $\boldsymbol{x}$ is mapped by a model $f$ to a vector of bitwise probabilities $\boldsymbol{\theta}\in [0,1]^{4096}$. Each candidate fingerprint $\boldsymbol{c}_j$ in the instance-specific candidate set $\mathcal{C}_i$ is scored by cosine similarity $s_j = \mathrm{sim}(\boldsymbol{\theta}, \mathbf{c}_j)$, and candidates are ranked by descending score. (b) Sources of uncertainty in the learned representation space. Aleatoric uncertainty is due to inherent noise and ambiguity in the data, and can arise for structurally similar molecules. Epistemic uncertainty reflects a general lack of knowledge, and arises in regions of the representation space that are far from training data. (c) Selective prediction. Left: test spectra are sorted by confidence in descending order. A threshold $\tau$ partitions spectra into accepted and rejected predictions. Right: sweeping the threshold $\tau$ over its full range traces the risk-coverage curve, where different points on the curve correspond to different risk-coverage trade-offs.
Figure 2: Average $\mathrm{Hit}@K$ rate for $K\in\{1, 5, 20\}$ for samples from the second-order distribution (grey) and their aggregate (blue, green, purple). Results are shown for a Deep Ensemble of $S=5$ members and the experimental setup as described in Section \ref{['sec:experimental-setup']}, evaluated on the test set of MassSpecGym with candidates filtered by molecular formula. The aggregate is computed using the different aggregation strategies described in Section \ref{['sec:experimental-setup']}.
Figure 3: Risk-coverage analysis for different scoring functions $\kappa$ based on different uncertainty estimates as described in Section \ref{['sec:scoring-functions']}. The selective risk is calculated using $\ell_K = 1 - \mathrm{Hit@}K$ for $K \in \{1, 5, 20\}$. Results are shown for a Deep Ensemble of $S=5$ members and the experimental setup as described in Section \ref{['sec:experimental-setup']}, evaluated on the test set of MassSpecGym with candidates filtered by molecular formula. Colors encode the uncertainty component: blue tones indicate total uncertainty, purple indicates aleatoric uncertainty, and red tones indicate epistemic uncertainty. Darker shades correspond to retrieval-level scores, lighter shades to fingerprint-level scores. (a) Risk-coverage curves. (b) AURC values.
Figure 4: Effect of the candidate set size on the retrieval performance and selective prediction quality. Results are shown for a Deep Ensemble of $S=5$ members and the experimental setup as described in Section \ref{['sec:experimental-setup']}, evaluated on the test set of MassSpecGym with candidates filtered by molecular formula. Left: Average hit rate on subsets with binned candidate set size $|\mathcal{C}|$. Middle, Right: AURC values for $\ell_1 = 1 - \mathrm{Hit@}1$ and different scoring functions, on subsets of data with $|\mathcal{C}| < 256$ vs. $|\mathcal{C}| = 256$.
Figure 5: Risk-controlled annotation with the SGR algorithm with $\delta=0.001$ and different target risks $r^*$. Results are shown for a Deep Ensemble of $S=5$ members and the experimental setup as described in Section \ref{['sec:experimental-setup']}, evaluated on the test set of MassSpecGym with candidates filtered by molecular formula. The scoring functions used for selection are the ones with the strongest risk-coverage trade-offs from Section \ref{['sec: risk-coverage analysis']}. The threshold $\tau^*$ is calibrated on one half of the test set, coverage and empirical risk are evaluated on the held-out other half. Top: coverage attained at each target risk level $r^*$ for $K\in\{1,5,20\}$. Note the independent vertical scales. Bottom: empirical risk versus target risk. All points lie below the diagonal, confirming the finite-sample guarantee.
...and 2 more figures

When should we trust the annotation? Selective prediction for molecular structure retrieval from mass spectra

TL;DR

Abstract

When should we trust the annotation? Selective prediction for molecular structure retrieval from mass spectra

Authors

TL;DR

Abstract

Table of Contents

Figures (7)