Self-supervised and Multi-fidelity Learning for Extended Predictive Soil Spectroscopy
Luning Sun, José L. Safanelli, Jonathan Sanderman, Katerina Georgiou, Colby Brungard, Kanchan Grover, Bryan G. Hopkins, Shusen Liu, Timo Bremer
TL;DR
This paper addresses the challenge of scalable, accurate soil spectroscopy by developing a self-supervised learning framework that derives a compact $32$-dimensional latent space from MIR spectra. It then creates a bridge to lower-cost NIR measurements by training an NIR encoder that maps into the MIR latent space while keeping the MIR decoder fixed. Predictive models map latent representations (and converted spectra) to nine soil properties, showing that MIR-derived embeddings yield higher accuracy and consistency, while NIR-to-MIR conversion via the latent space often matches or surpasses NIR-only baselines. The approach offers data efficiency, interpretability through latent-feature correlations, and a practical path for deploying portable NIR devices to leverage large MIR datasets in soil health monitoring.
Abstract
We propose a self-supervised machine learning (SSML) framework for multi-fidelity learning and extended predictive soil spectroscopy based on latent space embeddings. A self-supervised representation was pretrained with the large MIR spectral library and the Variational Autoencoder algorithm to obtain a compressed latent space for generating spectral embeddings. At this stage, only unlabeled spectral data were used, allowing us to leverage the full spectral database and the availability of scan repeats for augmented training. We also leveraged and froze the trained MIR decoder for a spectrum conversion task by plugging it into a NIR encoder to learn the mapping between NIR and MIR spectra in an attempt to leverage the predictive capabilities contained in the large MIR library with a low cost portable NIR scanner. This was achieved by using a smaller subset of the KSSL library with paired NIR and MIR spectra. Downstream machine learning models were then trained to map between original spectra, predicted spectra, and latent space embeddings for nine soil properties. The performance of was evaluated independently of the KSSL training data using a gold-standard test set, along with regression goodness-of-fit metrics. Compared to baseline models, the proposed SSML and its embeddings yielded similar or better accuracy in all soil properties prediction tasks. Predictions derived from the spectrum conversion (NIR to MIR) task did not match the performance of the original MIR spectra but were similar or superior to predictive performance of NIR-only models, suggesting the unified spectral latent space can effectively leverage the larger and more diverse MIR dataset for prediction of soil properties not well represented in current NIR libraries.
