Determination of Trace Organic Contaminant Concentration via Machine Classification of Surface-Enhanced Raman Spectra
Vishnu Jayaprakash, Jae Bem You, Chiranjeevi Kanike, Jinfeng Liu, Christopher McCallum, Xuehua Zhang
TL;DR
This study tackles the challenge of determining trace concentrations of persistent organic pollutants from noisy SERS data by applying machine-learning classifiers to unprocessed spectra. By transforming spectra with FFT and Walsh-Hadamard methods and training standard ML models, including a CNN with a data augmentation strategy, the authors achieve robust cross-validation accuracies exceeding 80% across three model pollutants, with higher performance on larger, cleaner datasets. The work also connects model-derived peak importances to known characteristic Raman peaks, offering insights for peak identification and robustness to substrate and noise variability. Collectively, the approach demonstrates potential for rapid, in-field concentration estimation of environmental pollutants using SERS coupled with machine learning. The techniques, including transform-based preprocessing and targeted augmentation, are applicable to broader SERS concentration sensing of trace organics.
Abstract
Accurate detection and analysis of traces of persistent organic pollutants in water is important in many areas, including environmental monitoring and food quality control, due to their long environmental stability and potential bioaccumulation. While conventional analysis of organic pollutants requires expensive equipment, surface enhanced Raman spectroscopy (SERS) has demonstrated great potential for accurate detection of these contaminants. However, SERS analytical difficulties, such as spectral preprocessing, denoising, and substrate-based spectral variation, have hindered widespread use of the technique. Here, we demonstrate an approach for predicting the concentration of sample pollutants from messy, unprocessed Raman data using machine learning. Frequency domain transform methods, including the Fourier and Walsh Hadamard transforms, are applied to sets of Raman spectra of three model micropollutants in water (rhodamine 6G, chlorpyrifos, and triclosan), which are then used to train machine learning algorithms. Using standard machine learning models, the concentration of sample pollutants are predicted with more than 80 percent cross-validation accuracy from raw Raman data. cross-validation accuracy of 85 percent was achieved using deep learning for a moderately sized dataset (100 spectra), and 70 to 80 percent cross-validation accuracy was achieved even for very small datasets (50 spectra). Additionally, standard models were shown to accurately identify characteristic peaks via analysis of their importance scores. The approach shown here has the potential to be applied to facilitate accurate detection and analysis of persistent organic pollutants by surface-enhanced Raman spectroscopy.
