Table of Contents
Fetching ...

GeoCrossBench: Cross-Band Generalization for Remote Sensing

Hakob Tamazyan, Ani Vanyan, Alvard Barseghyan, Anna Khosrovyan, Evan Shelhamer, Hrant Khachatrian

TL;DR

GeoCrossBench targets cross-band generalization in remote sensing by extending GeoBench with three evaluation settings that test in-distribution, no-overlap, and superset bands across scene classification, semantic segmentation, and change detection. It introduces ChiViT, a self-supervised ChannelViT baseline trained on a large multi-modal RS pretraining corpus to improve cross-band transfer. The experimental results show that state-of-the-art RS foundation models do not consistently outperform general-purpose vision models in-distribution, and all models suffer substantial degradation when generalizing to new bands, especially under No-Overlap and Superset settings; fine-tuning only the last linear layer with oracle labels can stabilize cross-satellite performance, suggesting the benchmark is not yet saturated. The work highlights the need for robust cross-band generalization strategies and provides public code and data to accelerate progress.

Abstract

The number and diversity of remote sensing satellites grows over time, while the vast majority of labeled data comes from older satellites. As the foundation models for Earth observation scale up, the cost of (re-)training to support new satellites grows too, so the generalization capabilities of the models towards new satellites become increasingly important. In this work we introduce GeoCrossBench, an extension of the popular GeoBench benchmark with a new evaluation protocol: it tests the in-distribution performance; generalization to satellites with no band overlap; and generalization to satellites with additional bands with respect to the training set. We also develop a self-supervised extension of ChannelViT, ChiViT, to improve its cross-satellite performance. First, we show that even the best foundation models for remote sensing (DOFA, TerraFM) do not outperform general purpose models like DINOv3 in the in-distribution setting. Second, when generalizing to new satellites with no band overlap, all models suffer 2-4x drop in performance, and ChiViT significantly outperforms the runner-up DINOv3. Third, the performance of all tested models drops on average by 5-25\% when given additional bands during test time. Finally, we show that fine-tuning just the last linear layer of these models using oracle labels from all bands can get relatively consistent performance across all satellites, highlighting that the benchmark is far from being saturated. We publicly release the code and the datasets to encourage the development of more future-proof remote sensing models with stronger cross-satellite generalization.

GeoCrossBench: Cross-Band Generalization for Remote Sensing

TL;DR

GeoCrossBench targets cross-band generalization in remote sensing by extending GeoBench with three evaluation settings that test in-distribution, no-overlap, and superset bands across scene classification, semantic segmentation, and change detection. It introduces ChiViT, a self-supervised ChannelViT baseline trained on a large multi-modal RS pretraining corpus to improve cross-band transfer. The experimental results show that state-of-the-art RS foundation models do not consistently outperform general-purpose vision models in-distribution, and all models suffer substantial degradation when generalizing to new bands, especially under No-Overlap and Superset settings; fine-tuning only the last linear layer with oracle labels can stabilize cross-satellite performance, suggesting the benchmark is not yet saturated. The work highlights the need for robust cross-band generalization strategies and provides public code and data to accelerate progress.

Abstract

The number and diversity of remote sensing satellites grows over time, while the vast majority of labeled data comes from older satellites. As the foundation models for Earth observation scale up, the cost of (re-)training to support new satellites grows too, so the generalization capabilities of the models towards new satellites become increasingly important. In this work we introduce GeoCrossBench, an extension of the popular GeoBench benchmark with a new evaluation protocol: it tests the in-distribution performance; generalization to satellites with no band overlap; and generalization to satellites with additional bands with respect to the training set. We also develop a self-supervised extension of ChannelViT, ChiViT, to improve its cross-satellite performance. First, we show that even the best foundation models for remote sensing (DOFA, TerraFM) do not outperform general purpose models like DINOv3 in the in-distribution setting. Second, when generalizing to new satellites with no band overlap, all models suffer 2-4x drop in performance, and ChiViT significantly outperforms the runner-up DINOv3. Third, the performance of all tested models drops on average by 5-25\% when given additional bands during test time. Finally, we show that fine-tuning just the last linear layer of these models using oracle labels from all bands can get relatively consistent performance across all satellites, highlighting that the benchmark is far from being saturated. We publicly release the code and the datasets to encourage the development of more future-proof remote sensing models with stronger cross-satellite generalization.

Paper Structure

This paper contains 40 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The GeoCrossBench evaluation framework. (1) In-Distribution: fine-tune on RGB and evaluate on RGB; fine-tune on full S2 and evaluate on S2. (2) No-Overlap: evaluate transfer from RGB$\to$S1 (VV, VH), RGB$\to$N'S1S2 (B8A, B11, B12) and S2$\to$S1. (3) Superset: RGB$\to$RGBN (RGB+NIR) and S2$\to$S2+S1 (optical+SAR fusion).
  • Figure 2: Overview of the iBOT-style self-distillation pretraining used for $\chi$ViT. Hierarchical channel sampling is applied to create distinct views for the student and teacher, where student channels are a subset of the teacher's channels. Shared projection weights and a shared prediction head are utilized, with losses computed for both CLS and patch tokens.
  • Figure 3: Quick summary of the main results on GeoCrossBench.
  • Figure 4: Performance of the models with frozen backbone (x-axis) vs. full fine-tuning (y-axis) for all pairs of models and datasets. (a) figure shows results colored by models, where stars indicate model's average performance. In figure (b) results are colored by datasets and stars are the average performance on each dataset.