Table of Contents
Fetching ...

Accurate Diagnosis of Respiratory Viruses Using an Explainable Machine Learning with Mid-Infrared Biomolecular Fingerprinting of Nasopharyngeal Secretions

Wenwen Zhang, Zhouzhuo Tang, Yingmei Feng, Xia Yu, Qi Jie Wang, Zhiping Lin

TL;DR

This work tackles rapid, multi-virus discrimination using noninvasive mid-infrared spectroscopy of nasopharyngeal secretions augmented by an explainable RoPE-SAT transformer. By preprocessing spectra with SNV normalization and second-order derivatives, augmenting data with Mixup, and applying Grad-CAM for interpretability, the approach achieves over 95% accuracy with high sensitivity and specificity across two cohorts despite differing VTMs and drying protocols. The model highlights biologically meaningful spectral regions corresponding to lipids, proteins (Amide bands), nucleic acids, and carbohydrates, providing mechanistic insight into virus-host spectral signatures. With an ~80% reduction in attention computation and robust performance in varied sample-preparation conditions, this method offers a scalable, on-site screening tool for respiratory viral infections and lays groundwork for broader virus coverage and real-world deployment.

Abstract

Accurate identification of respiratory viruses (RVs) is critical for outbreak control and public health. This study presents a diagnostic system that combines Attenuated Total Reflectance Fourier Transform Infrared Spectroscopy (ATR-FTIR) from nasopharyngeal secretions with an explainable Rotary Position Embedding-Sparse Attention Transformer (RoPE-SAT) model to accurately identify multiple RVs within 10 minutes. Spectral data (4000-00 cm-1) were collected, and the bio-fingerprint region (1800-900 cm-1) was employed for analysis. Standard normal variate (SNV) normalization and second-order derivation were applied to reduce scattering and baseline drift. Gradient-weighted class activation mapping (Grad-CAM) was employed to generate saliency maps, highlighting spectral regions most relevant to classification and enhancing the interpretability of model outputs. Two independent cohorts from Beijing Youan Hospital, processed with different viral transport media (VTMs) and drying methods, were evaluated, with one including influenza B, SARS-CoV-2, and healthy controls, and the other including mycoplasma, SARS-CoV-2, and healthy controls. The model achieved sensitivity and specificity above 94.40% across both cohorts. By correlating model-selected infrared regions with known biomolecular signatures, we verified that the system effectively recognizes virus-specific spectral fingerprints, including lipids, Amide I, Amide II, Amide III, nucleic acids, and carbohydrates, and leverages their weighted contributions for accurate classification.

Accurate Diagnosis of Respiratory Viruses Using an Explainable Machine Learning with Mid-Infrared Biomolecular Fingerprinting of Nasopharyngeal Secretions

TL;DR

This work tackles rapid, multi-virus discrimination using noninvasive mid-infrared spectroscopy of nasopharyngeal secretions augmented by an explainable RoPE-SAT transformer. By preprocessing spectra with SNV normalization and second-order derivatives, augmenting data with Mixup, and applying Grad-CAM for interpretability, the approach achieves over 95% accuracy with high sensitivity and specificity across two cohorts despite differing VTMs and drying protocols. The model highlights biologically meaningful spectral regions corresponding to lipids, proteins (Amide bands), nucleic acids, and carbohydrates, providing mechanistic insight into virus-host spectral signatures. With an ~80% reduction in attention computation and robust performance in varied sample-preparation conditions, this method offers a scalable, on-site screening tool for respiratory viral infections and lays groundwork for broader virus coverage and real-world deployment.

Abstract

Accurate identification of respiratory viruses (RVs) is critical for outbreak control and public health. This study presents a diagnostic system that combines Attenuated Total Reflectance Fourier Transform Infrared Spectroscopy (ATR-FTIR) from nasopharyngeal secretions with an explainable Rotary Position Embedding-Sparse Attention Transformer (RoPE-SAT) model to accurately identify multiple RVs within 10 minutes. Spectral data (4000-00 cm-1) were collected, and the bio-fingerprint region (1800-900 cm-1) was employed for analysis. Standard normal variate (SNV) normalization and second-order derivation were applied to reduce scattering and baseline drift. Gradient-weighted class activation mapping (Grad-CAM) was employed to generate saliency maps, highlighting spectral regions most relevant to classification and enhancing the interpretability of model outputs. Two independent cohorts from Beijing Youan Hospital, processed with different viral transport media (VTMs) and drying methods, were evaluated, with one including influenza B, SARS-CoV-2, and healthy controls, and the other including mycoplasma, SARS-CoV-2, and healthy controls. The model achieved sensitivity and specificity above 94.40% across both cohorts. By correlating model-selected infrared regions with known biomolecular signatures, we verified that the system effectively recognizes virus-specific spectral fingerprints, including lipids, Amide I, Amide II, Amide III, nucleic acids, and carbohydrates, and leverages their weighted contributions for accurate classification.

Paper Structure

This paper contains 13 sections, 5 equations, 6 figures.

Figures (6)

  • Figure 1: RVs identification using explainable RoPE-SAT model with infrared biomolecular fingerprinting of nasalpharyngeal secretions. a. Sample preparation protocol. b. Infrared spectral collection of nasalpharyngeal secretions. c.The proposed explainable RoPE-SAT model for RV identification, with interpretability demonstrated by the overlap between model-selected infrared regions and known biomolecular absorption bands. d. Infrared spectral signals collected from two groups of nasopharyngeal swab samples preserved in different VTM solutions. (i) Influenza B, SARS-CoV-2, and healthy controls. (ii) Mycoplasma, SARS-CoV-2, and healthy controls.
  • Figure 2: Characteristic absorption peaks of major biomolecules associated with viral infection, original and preprocessed spectral signals for samples from Cohort 1 and Cohort 2, and marginal PCA-based dimensionality reduction analysis before and after signal preprocessing. (a)Characteristic absorption peaks of major biomolecules associated with viral infection. (b) Original spectral signals for cohort1. (c)Spectral signals for cohort1 after SNV preprocessing. (d)Spectral signals for cohort1 after SNV and second-order derivative preprocessing. (e)Original spectral signals for cohort2. (f)Spectral signals for cohort2 after SNV preprocessing. (g)Spectral signals for cohort2 after SNV and second-order derivative preprocessing. (h)Marginal PCA of the original spectral signals from Cohort 1. (i)Marginal PCA of SNV- and second-derivative-processed spectra from Cohort 1. (j)Marginal PCA of the original spectral signals from Cohort 2. (k)Marginal PCA of SNV- and second-derivative-processed spectra from Cohort 2.
  • Figure 3: The detailed architecture of the proposed explainable RoPE-SAT model for spectral identification.
  • Figure 4: Determination of the overlap ratio between model-selected salient infrared regions and known biomolecular absorption bands using feature importance maps generated by the RoPE-SAT model for classifying influenza B, SARS-CoV-2, and healthy controls. (a)–(c)Overlapping regions between salient infrared ranges (feature importance weight > 0.2) selected by the model and known biomolecular absorption bands, as visualized through feature importance weight maps for (a)influenza B, (b)SARS-CoV-2, and (c)healthy controls. (d)Overlap ratios for each biomolecule. (e)Accuracy curves. (f)Loss curves. (g)Confusion matrix of the classification results. (h)ROC curves and corresponding AUC values.
  • Figure 5: Determination of the overlap ratio between model-selected salient infrared regions and known biomolecular absorption bands using feature importance maps generated by the RoPE-SAT model for classifying mycoplasma, SARS-CoV-2, and healthy controls. (a)–(c)Overlapping regions between salient infrared ranges (feature importance weight > 0.2) selected by the model and known biomolecular absorption bands for (a)mycoplasma, (b)SARS-CoV-2, and (c)healthy controls. (d)Overlap ratios for each biomolecule. (e)Accuracy cureves. (f)Loss curves. (g)Confusion matrix of the classification results. (h)ROC curves and corresponding AUC values.
  • ...and 1 more figures