Table of Contents
Fetching ...

Sim2Real in Reconstructive Spectroscopy: Deep Learning with Augmented Device-Informed Data Simulation

Jiyi Chen, Pengyu Li, Yutong Wang, Pei-Cheng Ku, Qing Qu

TL;DR

The paper tackles reconstructive spectroscopy under severe training-data constraints by bridging the sim-to-real gap. It introduces a Sim2Real framework that combines Hierarchical Data Augmentation to perturb the device response and a lightweight ReSpecNN network trained entirely on augmented simulated data, enabling fast, accurate spectral reconstruction on real measurements. Empirical results on real-world data show comparable accuracy to NNLS-TV while achieving an order-of-magnitude faster inference, highlighting practical benefits for on-chip, real-time spectroscopy. The work also discusses limitations, such as extreme outliers, and outlines avenues for improving robustness through adversarial augmentation and selective fine-tuning with limited real data. Overall, Sim2Real offers a scalable path to deploy DL-based reconstructive spectroscopy on resource-constrained devices without requiring large real labeled datasets.

Abstract

This work proposes a deep learning (DL)-based framework, namely Sim2Real, for spectral signal reconstruction in reconstructive spectroscopy, focusing on efficient data sampling and fast inference time. The work focuses on the challenge of reconstructing real-world spectral signals under the extreme setting where only device-informed simulated data are available for training. Such device-informed simulated data are much easier to collect than real-world data but exhibit large distribution shifts from their real-world counterparts. To leverage such simulated data effectively, a hierarchical data augmentation strategy is introduced to mitigate the adverse effects of this domain shift, and a corresponding neural network for the spectral signal reconstruction with our augmented data is designed. Experiments using a real dataset measured from our spectrometer device demonstrate that Sim2Real achieves significant speed-up during the inference while attaining on-par performance with the state-of-the-art optimization-based methods.

Sim2Real in Reconstructive Spectroscopy: Deep Learning with Augmented Device-Informed Data Simulation

TL;DR

The paper tackles reconstructive spectroscopy under severe training-data constraints by bridging the sim-to-real gap. It introduces a Sim2Real framework that combines Hierarchical Data Augmentation to perturb the device response and a lightweight ReSpecNN network trained entirely on augmented simulated data, enabling fast, accurate spectral reconstruction on real measurements. Empirical results on real-world data show comparable accuracy to NNLS-TV while achieving an order-of-magnitude faster inference, highlighting practical benefits for on-chip, real-time spectroscopy. The work also discusses limitations, such as extreme outliers, and outlines avenues for improving robustness through adversarial augmentation and selective fine-tuning with limited real data. Overall, Sim2Real offers a scalable path to deploy DL-based reconstructive spectroscopy on resource-constrained devices without requiring large real labeled datasets.

Abstract

This work proposes a deep learning (DL)-based framework, namely Sim2Real, for spectral signal reconstruction in reconstructive spectroscopy, focusing on efficient data sampling and fast inference time. The work focuses on the challenge of reconstructing real-world spectral signals under the extreme setting where only device-informed simulated data are available for training. Such device-informed simulated data are much easier to collect than real-world data but exhibit large distribution shifts from their real-world counterparts. To leverage such simulated data effectively, a hierarchical data augmentation strategy is introduced to mitigate the adverse effects of this domain shift, and a corresponding neural network for the spectral signal reconstruction with our augmented data is designed. Experiments using a real dataset measured from our spectrometer device demonstrate that Sim2Real achieves significant speed-up during the inference while attaining on-par performance with the state-of-the-art optimization-based methods.
Paper Structure (17 sections, 5 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 17 sections, 5 equations, 9 figures, 1 table, 1 algorithm.

Figures (9)

  • Figure 1: The diagram of our proof-of-concept Sim2Real framework in the reconstructive spectroscopy. The orange arrows denote existing methods, which require collecting and training on real-world data. The blue box and arrows denote our proposed Sim2Real framework that effectively address the domain shift between simulated and real-world data. The domain shift is visualized through PCA in \ref{['fig:pca']} See Section \ref{['section:methods']} for details. The Response Matrix plot is reprinted with permission from [13]. Copyright 2022 American Chemical Society.
  • Figure 2: The PCA Projection of Simulated and Real Data. The clear separation between clusters along the first two principal components highlights the distribution differences between the simulated data $\bm{y}_{\textsf{sim}}$ and real data $\bm{y}_{\textsf{real}}$, indicating a significant domain shift.
  • Figure 3: The testing RMSE on the simulated and real dataNNLS_TV for ResCNNkim2022compressive and our proposed model ReSpecNN. Both models followed the Sim2Real training setting, that is, trained solely on simulated data. Our model further incorporated the hierarchical data augmentation during training, while ResCNN does not.
  • Figure 4: The diagram of the simulated data generation procedure and our proposed hierarchical data augmentation (HDA) scheme. Simulated spectral signal $\bm{x}_{\textsf{sim}}$ is generated through the sum of Lorentzian distribution. For one given $\bm{x}_{\textsf{sim}}$, we generate many corresponding augmented encoded signals $\bm{y}^{(S, T)}_{\textsf{aug}}$ by adding noise $\Delta_{S}$ to $\bm R$ before multiplying with the spectral signal and adding noise $\bm{\epsilon}_T$ afterward. The two noise distributions could be chosen flexibly. This HDA process is summarized in detail by \ref{['algorithm:data_aug']}
  • Figure 5: The architecture of our proposed neural network. ReSpecNN comprises two fully-connected modules (dubbed as rec_fc and rf_fc) and a three-layer convolutional neural network (dubbed as conv). Note that rec_fc and rf_fc have a residual connection. For each linear layer in fully-connected modules, the number above represents its output dimension, where $L$ denotes the number of input wavelengths. For each 1D convolutional layer, the tuple below specifies the number of filters and the kernel size.
  • ...and 4 more figures