Data Augmentation Scheme for Raman Spectra with Highly Correlated Annotations

Christoph Lange; Isabel Thiele; Lara Santolin; Sebastian L. Riedel; Maxim Borisyak; Peter Neubauer; M. Nicolas Cruz Bournazou

Data Augmentation Scheme for Raman Spectra with Highly Correlated Annotations

Christoph Lange, Isabel Thiele, Lara Santolin, Sebastian L. Riedel, Maxim Borisyak, Peter Neubauer, M. Nicolas Cruz Bournazou

TL;DR

The paper tackles the challenge of strong annotation correlations in Raman spectroscopy data that hinder generalization across cultivation contexts. It introduces a decorrelation data augmentation scheme that generates uncorrelated labels by solving $\nabla Y = U$ via the SVD of $Y = U \Sigma V^T$ and applying $\nabla$ to spectra to obtain $\nabla X$, with a filtering rule to control noise amplification, complemented by data synthesis from mechanistic models and NMF-based spectral decomposition. In validation on synthetic batch cultivations of $R. eutropha$, models trained on decorrelated, noise-filtered data demonstrated improved transfer to new substrate mixtures and cultivation modes, compared to baselines. The approach enables reuse of historical Raman spectra for training new models in different process contexts, reducing the need for extensive new experiments and yielding more robust, context-insensitive predictions for PAT applications.

Abstract

In biotechnology Raman Spectroscopy is rapidly gaining popularity as a process analytical technology (PAT) that measures cell densities, substrate- and product concentrations. As it records vibrational modes of molecules it provides that information non-invasively in a single spectrum. Typically, partial least squares (PLS) is the model of choice to infer information about variables of interest from the spectra. However, biological processes are known for their complexity where convolutional neural networks (CNN) present a powerful alternative. They can handle non-Gaussian noise and account for beam misalignment, pixel malfunctions or the presence of additional substances. However, they require a lot of data during model training, and they pick up non-linear dependencies in the process variables. In this work, we exploit the additive nature of spectra in order to generate additional data points from a given dataset that have statistically independent labels so that a network trained on such data exhibits low correlations between the model predictions. We show that training a CNN on these generated data points improves the performance on datasets where the annotations do not bear the same correlation as the dataset that was used for model training. This data augmentation technique enables us to reuse spectra as training data for new contexts that exhibit different correlations. The additional data allows for building a better and more robust model. This is of interest in scenarios where large amounts of historical data are available but are currently not used for model training. We demonstrate the capabilities of the proposed method using synthetic spectra of Ralstonia eutropha batch cultivations to monitor substrate, biomass and polyhydroxyalkanoate (PHA) biopolymer concentrations during of the experiments.

Data Augmentation Scheme for Raman Spectra with Highly Correlated Annotations

TL;DR

via the SVD of

and applying

to spectra to obtain

, with a filtering rule to control noise amplification, complemented by data synthesis from mechanistic models and NMF-based spectral decomposition. In validation on synthetic batch cultivations of

, models trained on decorrelated, noise-filtered data demonstrated improved transfer to new substrate mixtures and cultivation modes, compared to baselines. The approach enables reuse of historical Raman spectra for training new models in different process contexts, reducing the need for extensive new experiments and yielding more robust, context-insensitive predictions for PAT applications.

Abstract

Paper Structure (10 sections, 7 equations, 3 figures, 2 tables)

This paper contains 10 sections, 7 equations, 3 figures, 2 tables.

Introduction
Material and Methods
Data Augmentation Scheme
Data Synthesis
Evaluation Setup
Datasets
Model Architecture
Results
Characteristics of the Decorrelation Algorithm
Conclusions

Figures (3)

Figure 1: The fit of ODE model to the observations of one cultivation. Left: Substrates. Right products. RCDW = residual cell dry weight, HB = hydroxybutyrate content of the copolymer.
Figure 2: Normalized spectra generated from the decorrelation algorithm in the training set and unchanged spectra from the validation set.
Figure 3: When filtering out samples with coefficients which norm is greater than 1, we observe this distribution for batch size $32$.

Data Augmentation Scheme for Raman Spectra with Highly Correlated Annotations

TL;DR

Abstract

Data Augmentation Scheme for Raman Spectra with Highly Correlated Annotations

Authors

TL;DR

Abstract

Table of Contents

Figures (3)