Data Augmentation Scheme for Raman Spectra with Highly Correlated Annotations
Christoph Lange, Isabel Thiele, Lara Santolin, Sebastian L. Riedel, Maxim Borisyak, Peter Neubauer, M. Nicolas Cruz Bournazou
TL;DR
The paper tackles the challenge of strong annotation correlations in Raman spectroscopy data that hinder generalization across cultivation contexts. It introduces a decorrelation data augmentation scheme that generates uncorrelated labels by solving $\nabla Y = U$ via the SVD of $Y = U \Sigma V^T$ and applying $\nabla$ to spectra to obtain $\nabla X$, with a filtering rule to control noise amplification, complemented by data synthesis from mechanistic models and NMF-based spectral decomposition. In validation on synthetic batch cultivations of $R. eutropha$, models trained on decorrelated, noise-filtered data demonstrated improved transfer to new substrate mixtures and cultivation modes, compared to baselines. The approach enables reuse of historical Raman spectra for training new models in different process contexts, reducing the need for extensive new experiments and yielding more robust, context-insensitive predictions for PAT applications.
Abstract
In biotechnology Raman Spectroscopy is rapidly gaining popularity as a process analytical technology (PAT) that measures cell densities, substrate- and product concentrations. As it records vibrational modes of molecules it provides that information non-invasively in a single spectrum. Typically, partial least squares (PLS) is the model of choice to infer information about variables of interest from the spectra. However, biological processes are known for their complexity where convolutional neural networks (CNN) present a powerful alternative. They can handle non-Gaussian noise and account for beam misalignment, pixel malfunctions or the presence of additional substances. However, they require a lot of data during model training, and they pick up non-linear dependencies in the process variables. In this work, we exploit the additive nature of spectra in order to generate additional data points from a given dataset that have statistically independent labels so that a network trained on such data exhibits low correlations between the model predictions. We show that training a CNN on these generated data points improves the performance on datasets where the annotations do not bear the same correlation as the dataset that was used for model training. This data augmentation technique enables us to reuse spectra as training data for new contexts that exhibit different correlations. The additional data allows for building a better and more robust model. This is of interest in scenarios where large amounts of historical data are available but are currently not used for model training. We demonstrate the capabilities of the proposed method using synthetic spectra of Ralstonia eutropha batch cultivations to monitor substrate, biomass and polyhydroxyalkanoate (PHA) biopolymer concentrations during of the experiments.
