AI-driven Generation of MALDI-TOF MS for Microbial Characterization
Lucía Schmidt-Santiago, David Rodríguez-Temporal, Carlos Sevilla-Salcedo, Vanessa Gómez-Verdejo
TL;DR
The paper tackles data scarcity and class imbalance in MALDI-TOF MS-based microbiology by adapting three conditional generative frameworks—MALDIVAE (VAE), MALDIGAN (GAN), and MALDIffusion (diffusion model)—to synthesize species-conditioned spectra. Using PIKE-based and diversity metrics, it demonstrates a fidelity-diversity trade-off: MALDIVAE offers the best balance of realistic peak structure and efficiency, MALDIGAN provides competitive diversity, and MALDIffusion yields broader coverage at a high computational cost. Importantly, classifiers trained exclusively on synthetic spectra can match real-data performance on standard test sets and, when augmented with synthetic data, significantly improve recognition of underrepresented species. These findings establish synthetic MALDI-TOF spectra as a practical tool to mitigate data scarcity and domain shift in clinical microbiology, enabling more robust, scalable ML workflows across centers.
Abstract
Matrix-Assisted Laser Desorption/Ionization Time-of-Flight Mass Spectrometry (MALDI-TOF MS) has become a cornerstone technology in clinical microbiology, enabling rapid and accurate microbial identification. However, the development of data-driven diagnostic models remains limited by the lack of sufficiently large, balanced, and standardized spectral datasets. This study investigates the use of deep generative models to synthesize realistic MALDI-TOF MS spectra, aiming to overcome data scarcity and support the development of robust machine learning tools in microbiology. We adapt and evaluate three generative models, Variational Autoencoders (MALDIVAEs), Generative Adversarial Networks (MALDIGANs), and Denoising Diffusion Probabilistic Model (MALDIffusion), for the conditional generation of microbial spectra guided by species labels. Generation is conditioned on species labels, and spectral fidelity and diversity are assessed using diverse metrics. Our experiments show that synthetic data generated by MALDIVAE, MALDIGAN, and MALDIffusion are statistically and diagnostically comparable to real measurements, enabling classifiers trained exclusively on synthetic samples to reach performance levels similar to those trained on real data. While all models faithfully reproduce the peak structure and variability of MALDI-TOF spectra, MALDIffusion obtains this fidelity at a substantially higher computational cost, and MALDIGAN shows competitive but slightly less stable behaviour. In contrast, MALDIVAE offers the most favorable balance between realism, stability, and efficiency. Furthermore, augmenting minority species with synthetic spectra markedly improves classification accuracy, effectively mitigating class imbalance and domain mismatch without compromising the authenticity of the generated data.
