Table of Contents
Fetching ...

Cryo-Bench: Benchmarking Foundation Models for Cryosphere Applications

Saurabh Kaushik, Lalit Maurya, Beth Tellman

Abstract

Geo-Foundation Models (GFMs) have been evaluated across diverse Earth observation task including multiple domains and have demonstrated strong potential of producing reliable maps even with sparse labels. However, benchmarking GFMs for Cryosphere applications has remained limited, primarily due to the lack of suitable evaluation datasets. To address this gap, we introduce \textbf{Cryo-Bench}, a benchmark compiled to evaluate GFM performance across key Cryospheric components. Cryo-Bench includes debris-covered glaciers, glacial lakes, sea ice, and calving fronts, spanning multiple sensors and broad geographic regions. We evaluate 14 GFMs alongside UNet and ViT baselines to assess their advantages, limitations, and optimal usage strategies. With a frozen encoder, UNet achieves the highest average mIoU of \textbf{66.38}, followed by TerraMind at \textbf{64.02} across five evluation dataset included in Cryo-Bench. In the few-shot setting (10\% input data), GFMs such as DOFA and TerraMind outperform UNet, achieving mIoU scores of \textbf{59.53}, \textbf{56.62}, and \textbf{56.60}, respectively, comapred to U-Net's 56.60. When fully finetuning GFMs, we observe inconsistent performance across datasets and models. However, tuning learning rate along with finetuning substantially improves GFM performance. For example, evaluation on two representative datasets (GLID and CaFFe) shows an average relative improvement of \textbf{12.77\%}. Despite having minimal Cryosphere representation in their pretraining data, GFMs exhibit notable domain adaptation capabilities and produce meaningful results across tasks. Based on our findings, We recommend encoder fine-tuning with hyperparameter optimization optimization to achieve the best possible performance, while using frozen encoders when users need quick results without extensive experimentation.(\href{https://github.com/Sk-2103/Cryo-Bench}{GitHub}).

Cryo-Bench: Benchmarking Foundation Models for Cryosphere Applications

Abstract

Geo-Foundation Models (GFMs) have been evaluated across diverse Earth observation task including multiple domains and have demonstrated strong potential of producing reliable maps even with sparse labels. However, benchmarking GFMs for Cryosphere applications has remained limited, primarily due to the lack of suitable evaluation datasets. To address this gap, we introduce \textbf{Cryo-Bench}, a benchmark compiled to evaluate GFM performance across key Cryospheric components. Cryo-Bench includes debris-covered glaciers, glacial lakes, sea ice, and calving fronts, spanning multiple sensors and broad geographic regions. We evaluate 14 GFMs alongside UNet and ViT baselines to assess their advantages, limitations, and optimal usage strategies. With a frozen encoder, UNet achieves the highest average mIoU of \textbf{66.38}, followed by TerraMind at \textbf{64.02} across five evluation dataset included in Cryo-Bench. In the few-shot setting (10\% input data), GFMs such as DOFA and TerraMind outperform UNet, achieving mIoU scores of \textbf{59.53}, \textbf{56.62}, and \textbf{56.60}, respectively, comapred to U-Net's 56.60. When fully finetuning GFMs, we observe inconsistent performance across datasets and models. However, tuning learning rate along with finetuning substantially improves GFM performance. For example, evaluation on two representative datasets (GLID and CaFFe) shows an average relative improvement of \textbf{12.77\%}. Despite having minimal Cryosphere representation in their pretraining data, GFMs exhibit notable domain adaptation capabilities and produce meaningful results across tasks. Based on our findings, We recommend encoder fine-tuning with hyperparameter optimization optimization to achieve the best possible performance, while using frozen encoders when users need quick results without extensive experimentation.(\href{https://github.com/Sk-2103/Cryo-Bench}{GitHub}).
Paper Structure (14 sections, 8 figures, 10 tables)

This paper contains 14 sections, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Cryo-Bench dataset consists of four major Cryosphere components including debris covered glaciers, glacial lakes, calving front and sea ice. Cryo-Bench is comprehensively evaluated over 14 GFM against U-Net and ViT baseline
  • Figure 2: Performance of the top eight GFMs on the Cryo-Bench dataset. (a) TerraMind achieves the highest performance among all GFMs when the encoder is kept frozen. (b) In the few-shot setting, using 10% of the training data, DOFA outperforms all other GFMs. Performance is reported in mIoU..
  • Figure 3: Model's computation and performance tradeoff using (a) GLID and (b) CaFFe dataset. We chose the best mIoU obtained from either frozen encoder or fine-tuned encoder including learning rate tuning.
  • Figure 4: (a) Comparison of model performance in the few-shot experiment, showing the percentage of full-data performance retained when using only 10% of the labels. This is computed as (average mIoU at 10%) / (average mIoU at 100%) $\times$ 100. (b) Model-specific gains or losses under full fine tuning relative to the frozen-encoder setting, aggregated across all five evaluation datasets.
  • Figure 5: Net gain in model performance from learning-rate optimization during fine tuning of the encoder, reported relative to the frozen-encoder setting.
  • ...and 3 more figures