To Bin or not to Bin: Alternative Representations of Mass Spectra
Niek de Jonge, Justin J. J. van der Hooft, Daniel Probst
TL;DR
Mass spectra are often discretized by binning, which can discard information relevant to molecular-property prediction. The authors propose two non-binned representations—a set-based representation and a graph-based representation—implemented via a SetTransformer and a Graph Attention Network (GAT) and evaluated on a $QED$ regression task, demonstrating that both outperform a baseline MLP trained on binned data, with the graph approach delivering the best performance. They report favorable parameter efficiency for the non-binned methods and provide ready-to-use encoders and code, highlighting the practical viability of these representations. The results suggest graphs enable effective information propagation across peaks, and the approach has potential applicability to other metabolomics tasks such as similarity prediction or structure elucidation.
Abstract
Mass spectrometry, especially so-called tandem mass spectrometry, is commonly used to assess the chemical diversity of samples. The resulting mass fragmentation spectra are representations of molecules of which the structure may have not been determined. This poses the challenge of experimentally determining or computationally predicting molecular structures from mass spectra. An alternative option is to predict molecular properties or molecular similarity directly from spectra. Various methodologies have been proposed to embed mass spectra for further use in machine learning tasks. However, these methodologies require preprocessing of the spectra, which often includes binning or sub-sampling peaks with the main reasoning of creating uniform vector sizes and removing noise. Here, we investigate two alternatives to the binning of mass spectra before down-stream machine learning tasks, namely, set-based and graph-based representations. Comparing the two proposed representations to train a set transformer and a graph neural network on a regression task, respectively, we show that they both perform substantially better than a multilayer perceptron trained on binned data.
