Machine Learning for Raman Spectroscopy-based Cyber-Marine Fish Biochemical Composition Analysis
Yun Zhou, Gang Chen, Bing Xue, Mengjie Zhang, Jeremy S. Rooney, Kirill Lagutin, Andrew MacKenzie, Keith C. Gordon, Daniel P. Killeen
TL;DR
This work tackles the real-time, non-destructive estimation of fish biochemical composition from Raman spectra when data are severely limited. It introduces FishCNN, a large-kernel 1D CNN paired with a rigorously designed preprocessing and augmentation pipeline to enable reliable multi-output regression of water, protein, and lipids yields from ultra-small datasets. The approach outperforms traditional methods and two CNN baselines on FT-Raman and InGaAs 1064 nm data, with statistically significant gains in $R^{2}$ and RMSE across targets, particularly on InGaAs data. The results demonstrate the practical potential for automated, real-time biochemical analysis in the seafood industry and point to future enhancements via Transformer-era architectures and attention mechanisms for improved interpretability and scalability.
Abstract
The rapid and accurate detection of biochemical compositions in fish is a crucial real-world task that facilitates optimal utilization and extraction of high-value products in the seafood industry. Raman spectroscopy provides a promising solution for quickly and non-destructively analyzing the biochemical composition of fish by associating Raman spectra with biochemical reference data using machine learning regression models. This paper investigates different regression models to address this task and proposes a new design of Convolutional Neural Networks (CNNs) for jointly predicting water, protein, and lipids yield. To the best of our knowledge, we are the first to conduct a successful study employing CNNs to analyze the biochemical composition of fish based on a very small Raman spectroscopic dataset. Our approach combines a tailored CNN architecture with the comprehensive data preparation procedure, effectively mitigating the challenges posed by extreme data scarcity. The results demonstrate that our CNN can significantly outperform two state-of-the-art CNN models and multiple traditional machine learning models, paving the way for accurate and automated analysis of fish biochemical composition.
