Table of Contents
Fetching ...

GreenHyperSpectra: A multi-source hyperspectral dataset for global vegetation trait prediction

Eya Cherif, Arthur Ouaknine, Luke A. Brown, Phuong D. Dao, Kyle R. Kovach, Bing Lu, Daniel Mederer, Hannes Feilhauer, Teja Kattenborn, David Rolnick

TL;DR

GreenHyperSpectra addresses the challenge of predicting plant functional traits from hyperspectral data under label scarcity and domain shifts by providing a large, cross-sensor pretraining dataset for semi- and self-supervised regression. The study demonstrates that a masked autoencoder (MAE) pretrained on full-range spectra and fine-tuned for trait prediction (MAE-FR-FT) delivers the strongest performance across both full-range and half-range data, outperforming a fully supervised baseline on $R^2$ and $nRMSE$. MAE-based pretraining also shows robust cross-domain generalization and resilience to sensor noise, underscoring the value of large-scale unlabeled spectral data for ecosystem monitoring. The work provides open access to code and data, enabling reproducibility and future extension to broader cross-domain spectral datasets and more diverse plant traits.

Abstract

Plant traits such as leaf carbon content and leaf mass are essential variables in the study of biodiversity and climate change. However, conventional field sampling cannot feasibly cover trait variation at ecologically meaningful spatial scales. Machine learning represents a valuable solution for plant trait prediction across ecosystems, leveraging hyperspectral data from remote sensing. Nevertheless, trait prediction from hyperspectral data is challenged by label scarcity and substantial domain shifts (\eg across sensors, ecological distributions), requiring robust cross-domain methods. Here, we present GreenHyperSpectra, a pretraining dataset encompassing real-world cross-sensor and cross-ecosystem samples designed to benchmark trait prediction with semi- and self-supervised methods. We adopt an evaluation framework encompassing in-distribution and out-of-distribution scenarios. We successfully leverage GreenHyperSpectra to pretrain label-efficient multi-output regression models that outperform the state-of-the-art supervised baseline. Our empirical analyses demonstrate substantial improvements in learning spectral representations for trait prediction, establishing a comprehensive methodological framework to catalyze research at the intersection of representation learning and plant functional traits assessment. All code and data are available at: https://github.com/echerif18/HyspectraSSL.

GreenHyperSpectra: A multi-source hyperspectral dataset for global vegetation trait prediction

TL;DR

GreenHyperSpectra addresses the challenge of predicting plant functional traits from hyperspectral data under label scarcity and domain shifts by providing a large, cross-sensor pretraining dataset for semi- and self-supervised regression. The study demonstrates that a masked autoencoder (MAE) pretrained on full-range spectra and fine-tuned for trait prediction (MAE-FR-FT) delivers the strongest performance across both full-range and half-range data, outperforming a fully supervised baseline on and . MAE-based pretraining also shows robust cross-domain generalization and resilience to sensor noise, underscoring the value of large-scale unlabeled spectral data for ecosystem monitoring. The work provides open access to code and data, enabling reproducibility and future extension to broader cross-domain spectral datasets and more diverse plant traits.

Abstract

Plant traits such as leaf carbon content and leaf mass are essential variables in the study of biodiversity and climate change. However, conventional field sampling cannot feasibly cover trait variation at ecologically meaningful spatial scales. Machine learning represents a valuable solution for plant trait prediction across ecosystems, leveraging hyperspectral data from remote sensing. Nevertheless, trait prediction from hyperspectral data is challenged by label scarcity and substantial domain shifts (\eg across sensors, ecological distributions), requiring robust cross-domain methods. Here, we present GreenHyperSpectra, a pretraining dataset encompassing real-world cross-sensor and cross-ecosystem samples designed to benchmark trait prediction with semi- and self-supervised methods. We adopt an evaluation framework encompassing in-distribution and out-of-distribution scenarios. We successfully leverage GreenHyperSpectra to pretrain label-efficient multi-output regression models that outperform the state-of-the-art supervised baseline. Our empirical analyses demonstrate substantial improvements in learning spectral representations for trait prediction, establishing a comprehensive methodological framework to catalyze research at the intersection of representation learning and plant functional traits assessment. All code and data are available at: https://github.com/echerif18/HyspectraSSL.

Paper Structure

This paper contains 41 sections, 6 equations, 15 figures, 30 tables.

Figures (15)

  • Figure 1: Overview of the semi/self-supervised framework for multi-trait regression task.
  • Figure 2: Spatial coverage of the datasets. Points represent sample locations of GreenHyperSpectra compared to the existing labeled dataset. GreenHyperSpectra data span diverse vegetation type and acquisition conditions.
  • Figure 3: Overview of the semi- and self-supervised frameworks. (\ref{['fig:archi_gan']}) The semi-supervised regression GAN framework (SR-GAN): the generator maps a random noise $z$ to synthetic samples $\hat{x}$, while the discriminator processes fake samples ($x_{\text{fake}}$), unlabeled real samples ($x_{\text{unlb}}$), and labeled real data samples ($x_{\text{lb}}$) with associated traits (y), optimizing fake ($L_{\text{fake}}$), unlabeled ($L_{\text{unlb}}$), and labeled ($L_{\text{lb}}$) losses respectively. (\ref{['fig:archi_ae_rtm']}) The RTM-based autoencoder (RTM-AE) predicts traits from labeled embeddings while reconstructing spectra ($x \to \hat{x}$, ($L_{\text{recon}}$)). (\ref{['fig:archi_mae']}). The 1D masked autoencoder framework (1D-MAE) reconstructs masked spectra through tokenization, ($L_{\text{recon}}$); the learned representations are then used for trait prediction ($L_{\text{lb}}$). Abbreviations: $x_{\text{fake}}$: generated fake spectra from the generator; $x_{\text{unlb}}$: unlabeled sample from GreenHyperSpectra; $x_{\text{lb}}$: spectra sample from the labeled data; $L_{\text{unlb}}$: unlabeled loss; $L_{\text{lb}}$: labeled loss; $L_{\text{recon}}$: reconstruction loss; $L_{\text{fake}}$: feature contrasting loss; RTM: radiative transfer model; AE: autoencoder; MAE: masked autoencoder.
  • Figure 4: Evaluation of trait prediction with variable-size labeled sets. Validation performance ($R^2$) as a function of labeled data percentage used for training. The average $R^2$ performance across all traits is indicated by the dashed box. The higher $R^2$, the better. For trait abbreviations, see Sec. \ref{['sec:LbSection']}.
  • Figure 5: Evaluation of trait prediction with variable-size unlabeled sets. Validation performance ($R^2$) as a function of the percentage of unlabeled data used for training. The average $R^2$ performance is indicated by the dashed box. The higher $R^2$, the better. For trait abbreviations, see Sec. \ref{['sec:LbSection']}.
  • ...and 10 more figures