Table of Contents
Fetching ...

StefaLand: An Efficient Geoscience Foundation Model That Improves Dynamic Land-Surface Predictions

Nicholas Kraabel, Jiangtao Liu, Yuchen Bian, Daniel Kifer, Chaopeng Shen

TL;DR

The paper tackles the challenge of spatial generalization in climate-driven land-surface prediction by introducing StefaLand, a statically grounded, attribute-based spatiotemporal foundation model. StefaLand employs a transformer-based masked autoencoder with cross-variable group masking to learn cross-domain interactions between static landscape attributes and dynamic forcings, followed by lightweight finetuning with residual adapters (StefaLand-resConn) for task-specific predictions. Across five datasets and four task classes—streamflow, soil moisture, soil composition, and landslide susceptibility—the model achieves state-of-the-art or competitive performance, substantially outperforming purely supervised baselines and alternative pretrained representations, while maintaining computational efficiency (pretraining around 720 GPU hours and ~12 million parameters). The results highlight the value of cross-domain representations and the effectiveness of attribute-centric pretraining to enable data-efficient generalization in data-scarce regions, with practical implications for hydrology and geohazards forecasting, though future work is needed to broaden targets, incorporate image-like data, and add uncertainty quantification.

Abstract

Managing natural resources and mitigating risks from floods, droughts, wildfires, and landslides require models that can accurately predict climate-driven land-surface responses. Traditional models often struggle with spatial generalization because they are trained or calibrated on limited observations and can degrade under concept drift. Recently proposed vision foundation models trained on satellite imagery demand massive compute, and they are not designed for dynamic land surface prediction tasks. We introduce StefaLand, a generative spatiotemporal Earth representation learning model centered on learning cross-domain interactions to suppress overfitting. StefaLand demonstrates especially strong spatial generalization on five datasets across four important tasks: streamflow, soil moisture, soil composition and landslides, compared to previous state-of-the-art methods. The domain-inspired design choices include a location-aware masked autoencoder that fuses static and time-series inputs, an attribute-based rather than image-based representation that drastically reduces compute demands, and residual fine-tuning adapters that strengthen knowledge transfer across tasks. StefaLand can be pretrained and finetuned on commonly available academic compute resources, yet consistently outperforms state-of-the-art supervised learning baselines, fine-tuned vision foundation models and commercially available embeddings, highlighting the previously overlooked value of cross-domain interactions and providing assistance to data-poor regions of the world.

StefaLand: An Efficient Geoscience Foundation Model That Improves Dynamic Land-Surface Predictions

TL;DR

The paper tackles the challenge of spatial generalization in climate-driven land-surface prediction by introducing StefaLand, a statically grounded, attribute-based spatiotemporal foundation model. StefaLand employs a transformer-based masked autoencoder with cross-variable group masking to learn cross-domain interactions between static landscape attributes and dynamic forcings, followed by lightweight finetuning with residual adapters (StefaLand-resConn) for task-specific predictions. Across five datasets and four task classes—streamflow, soil moisture, soil composition, and landslide susceptibility—the model achieves state-of-the-art or competitive performance, substantially outperforming purely supervised baselines and alternative pretrained representations, while maintaining computational efficiency (pretraining around 720 GPU hours and ~12 million parameters). The results highlight the value of cross-domain representations and the effectiveness of attribute-centric pretraining to enable data-efficient generalization in data-scarce regions, with practical implications for hydrology and geohazards forecasting, though future work is needed to broaden targets, incorporate image-like data, and add uncertainty quantification.

Abstract

Managing natural resources and mitigating risks from floods, droughts, wildfires, and landslides require models that can accurately predict climate-driven land-surface responses. Traditional models often struggle with spatial generalization because they are trained or calibrated on limited observations and can degrade under concept drift. Recently proposed vision foundation models trained on satellite imagery demand massive compute, and they are not designed for dynamic land surface prediction tasks. We introduce StefaLand, a generative spatiotemporal Earth representation learning model centered on learning cross-domain interactions to suppress overfitting. StefaLand demonstrates especially strong spatial generalization on five datasets across four important tasks: streamflow, soil moisture, soil composition and landslides, compared to previous state-of-the-art methods. The domain-inspired design choices include a location-aware masked autoencoder that fuses static and time-series inputs, an attribute-based rather than image-based representation that drastically reduces compute demands, and residual fine-tuning adapters that strengthen knowledge transfer across tasks. StefaLand can be pretrained and finetuned on commonly available academic compute resources, yet consistently outperforms state-of-the-art supervised learning baselines, fine-tuned vision foundation models and commercially available embeddings, highlighting the previously overlooked value of cross-domain interactions and providing assistance to data-poor regions of the world.

Paper Structure

This paper contains 47 sections, 19 equations, 6 figures, 25 tables.

Figures (6)

  • Figure 1: Conceptual overview of the StefaLand Structure. Static landscape attributes and dynamic forcings are jointly embedded using a transformer-based masked autoencoder with cross-variable group masking. With relevant dimensionality included.
  • Figure 2: Ablation impact matrix across evaluation settings using RMSE. Each cell shows the percent RMSE increase relative to the full StefaLand ResConn model (lower is better), highlighting the contributions of pretraining and task adaptation. Blank or hatched cells indicate ablations not evaluated for that setting.
  • Figure 3: Adapter ablation on streamflow random spatial splits (PUB). Boxplots summarize per-basin performance distributions for five adapter designs (Feedforward, Bottleneck, MoE, Gated, Residual). We report RMSE (lower is better), correlation (higher is better), and NSE (higher is better) across held-out basins, showing that the Residual adapter yields the most consistent gains, particularly in Corr and NSE.
  • Figure 4: Spatial distribution of the global streamflow dataset. Basins are categorized according to the availability of runoff observations: basins with relatively abundant runoff records (blue), basins with sparse runoff records (orange), and basins without runoff data (green). Marker size corresponds to basin area, classified into three categories based on the 33rd and 67th percentiles of catchment areas.
  • Figure 5: Fine-tuning pipeline used in our downstream experiments. Static attributes and meteorological forcings are encoded by the pretrained StefaLand encoder (frozen), then passed through a task adapter and sequence model (LSTM), followed by projection layers to generate predictions optimized with a task loss against ground truth.
  • ...and 1 more figures