Table of Contents
Fetching ...

Rapid Machine Learning-Driven Detection of Pesticides and Dyes Using Raman Spectroscopy

Quach Thi Thai Binh, Thuan Phuoc, Xuan Hai, Thang Bach Phan, Vu Thi Hanh Thu, Nguyen Tuan Hung

TL;DR

The paper tackles the need for rapid, reliable detection of pesticide and dye residues by leveraging Raman spectroscopy combined with a CNN-based feature extractor (ResNet-18) and hybrid classifiers. The MLRaman framework converts spectra to 2D spectral images, extracts 512-d embeddings, and classifies ten analytes using XGBoost, SVM, and a VotingClassifier, achieving a best accuracy of 97.4% and an AUC of 1.00 on validation data. Dimensionality-reduction visualizations (PCA, t-SNE, UMAP) confirm strong separability of embeddings, and an external Streamlit app demonstrated real-time, unseen-spectrum predictions with strong generalization. Overall, the approach provides a scalable, practical solution for multi-residue contaminant monitoring in food safety and environmental surveillance, including deployment-ready tools for real-time decision support.

Abstract

The extensive use of pesticides and synthetic dyes poses critical threats to food safety, human health, and environmental sustainability, necessitating rapid and reliable detection methods. Raman spectroscopy offers molecularly specific fingerprints but suffers from spectral noise, fluorescence background, and band overlap, limiting its real-world applicability. Here, we propose a deep learning framework based on ResNet-18 feature extraction, combined with advanced classifiers, including XGBoost, SVM, and their hybrid integration, to detect pesticides and dyes from Raman spectroscopy, called MLRaman. The MLRaman with the CNN-XGBoost model achieved a predictive accuracy of 97.4% and a perfect AUC of 1.0, while it with the CNN-SVM model provided competitive results with robust class-wise discrimination. Dimensionality reduction analyses (PCA, t-SNE, UMAP) confirmed the separability of Raman embeddings across 10 analytes, including 7 pesticides and 3 dyes. Finally, we developed a user-friendly Streamlit application for real-time prediction, which successfully identified unseen Raman spectra from our independent experiments and also literature sources, underscoring strong generalization capacity. This study establishes a scalable, practical MLRaman model for multi-residue contaminant monitoring, with significant potential for deployment in food safety and environmental surveillance.

Rapid Machine Learning-Driven Detection of Pesticides and Dyes Using Raman Spectroscopy

TL;DR

The paper tackles the need for rapid, reliable detection of pesticide and dye residues by leveraging Raman spectroscopy combined with a CNN-based feature extractor (ResNet-18) and hybrid classifiers. The MLRaman framework converts spectra to 2D spectral images, extracts 512-d embeddings, and classifies ten analytes using XGBoost, SVM, and a VotingClassifier, achieving a best accuracy of 97.4% and an AUC of 1.00 on validation data. Dimensionality-reduction visualizations (PCA, t-SNE, UMAP) confirm strong separability of embeddings, and an external Streamlit app demonstrated real-time, unseen-spectrum predictions with strong generalization. Overall, the approach provides a scalable, practical solution for multi-residue contaminant monitoring in food safety and environmental surveillance, including deployment-ready tools for real-time decision support.

Abstract

The extensive use of pesticides and synthetic dyes poses critical threats to food safety, human health, and environmental sustainability, necessitating rapid and reliable detection methods. Raman spectroscopy offers molecularly specific fingerprints but suffers from spectral noise, fluorescence background, and band overlap, limiting its real-world applicability. Here, we propose a deep learning framework based on ResNet-18 feature extraction, combined with advanced classifiers, including XGBoost, SVM, and their hybrid integration, to detect pesticides and dyes from Raman spectroscopy, called MLRaman. The MLRaman with the CNN-XGBoost model achieved a predictive accuracy of 97.4% and a perfect AUC of 1.0, while it with the CNN-SVM model provided competitive results with robust class-wise discrimination. Dimensionality reduction analyses (PCA, t-SNE, UMAP) confirmed the separability of Raman embeddings across 10 analytes, including 7 pesticides and 3 dyes. Finally, we developed a user-friendly Streamlit application for real-time prediction, which successfully identified unseen Raman spectra from our independent experiments and also literature sources, underscoring strong generalization capacity. This study establishes a scalable, practical MLRaman model for multi-residue contaminant monitoring, with significant potential for deployment in food safety and environmental surveillance.

Paper Structure

This paper contains 17 sections, 9 figures.

Figures (9)

  • Figure 1: An integrated pipeline for pesticide identification using Raman spectroscopy and hybrid CNN–machine learning classification. Firstly, the Raman spectral data are collected from literature sources or experimental measurements, which undergo signal preprocessing, including baseline correction and normalization to reduce noise and variability. Then, the processed 1D spectra are converted into 2D image representations and resized uniformly to $256\times 256$ pixels. Next, a ResNet-18 model, pretrained on ImageNet and fine-tuned on spectral images, extracts 512-dimensional deep feature representations. These embeddings are used for both dimensionality reduction (via PCA, t-SNE, and UMAP) to explore class separability, and for classification using support vector machines (SVM), eXtreme Gradient Boosting (XGBoost), and ensemble VotingClassifier models. Classifier optimization is performed using GridSearchCV. Finally, performance is quantitatively evaluated with accuracy, F1-score, and a confusion matrix.
  • Figure 2: Raman spectra dataset for 10 pesticides and dyes. (a) The Raman spectra of the 10 compounds, including carbendazim (CBZ), carbaryl (CR), thiram (TMTD), thiabendazole (TBZ), rhodamine 6G (R6G), rhodamine B (RB), crystal violet (CV), methyl parathion (MP), cypermethrin (CYP), and chlorpyrifos (CPF), and their molecular formulas. (b) Bar chart showing the distribution of Raman spectral images across 10 chemical classes. The dataset shows class imbalance, with TBZ and TMTD having the largest sample counts. (c) Corresponding pie chart illustrating the proportion of each class within the dataset, reflecting a relatively skewed but comprehensive representation of the target analytes.
  • Figure 3: Comparison between the raw Raman spectrum of thiram (gray), the estimated baseline via ALS (orange dashed), and the preprocessed spectrum after baseline removal and SG smoothing (blue). The pipeline suppresses baseline drift, enhances peak clarity, and reduces high-frequency noise, yielding standardized spectral inputs for machine-learning models.
  • Figure 4: Training and validation performance of the ResNet18 model over 30 epochs, showing decreasing accuracy (a) and loss (b) history with validation accuracy plateauing at 80%.
  • Figure 5: Performance evaluation of XGBoost model with the CNN features and the PCA dimensionality reduction, including (a) confusion matrix, (b) ROC curves, and (c) per-class F1-scores, for the validation dataset.
  • ...and 4 more figures